Historical document triaging
Optical character recognition (OCR) of historical texts in the hand-press period (roughly 1475-1800) is a challenging task due to the characteristics of the physical documents and the quality of their scanned images. Early printing processes (printing presses, mass paper production, hand-made typefaces) produced texts with fluctuating baselines, mixed fonts, and varied concentrations of ink, among many other irregularities. As shown below, OCR results on a typical historical document include a number of false positives (bounding boxes where text does not exist) and false negatives (text that isn’t recognized as such).
To address this issue, we are developing machine-learning algorithms to correct OCR errors by analyzing the distribution and geometry of bounding boxes returned by the OCR engine. Results are show below; in this case, the intensity of the colored boxes represents the confidence assigned by our algorithms to each bounding box.
This work is done in collaboration with Professors Laura Mandell and Rick Furuta as part of eMOP (http://emop.tamu.edu), a project that seeks to digitize some 45 million historical document pages from the Eighteenth Century Collections Online (ECCO) and Early English Books Online (EEBO), and create tools (dictionaries, workflows, and databases) to support scholarly research at libraries and museums.
Mass Digitization of Early Modern Texts with Optical Character Recognition (Article)
ACM Journal on Computing and Cultural Heritage, in press, 2017.
Automatic assessment of OCR quality in historical documents (Inproceeding)
Proc. AAAI, 2015.
Diagnosing Page Image Problems with Post-OCR Triage for eMOP (Inproceeding)
Proc. Digital Humanities Conference, 2014.