Blue Waters User Portal | Science Teams

Removing Barriers to Extreme-Scale Analysis of Digitized Text Archives with Quality Scoring and Error Modeling Strategies

Scott Althaus, University of Illinois at Urbana-Champaign

Usage Details

David Tcheng, Loretta Auvil, Boris Capitanu, Ted Underwood, Scott Althaus

The most important barrier to conducting extreme-scale analysis of unstructured data within digitized text archives is the uncertain quality of the textual representations of scanned page images derived from Optical Character Recognition (OCR) techniques. We will estimate the required computational resources and develop the optimization strategies to use Blue Waters to detect, score, and correct OCR errors in the HathiTrust Public Use Dataset, which is the world’s largest corpus of digitized library volumes in the public domain. This exploratory project will set the stage for a later project that develops error detection and quality scoring strategies that can enhance the volume-level metadata managed by the HathiTrust Research Center (HTRC) with probabilistic quality metrics.

http://faculty.las.illinois.edu/salthaus/