Unsupervised Extraction of Training Data for Pre-Modern Chinese OCR

Published in the Proceedings of the 30th International Florida Artificial Intelligence Research Society Conference (FLAIRS-30), 2017.


Many mainstream OCR techniques involve training a character recognition model using labeled exemplary images of each individual character to be recognized. For modern printed writing, such data can be easily created by automated methods such as rasterizing appropriate font data to produce clean example images. For historical OCR in printing and writing styles distinct from those embodied in modern fonts, appropriate character images must instead be extracted from actual historical documents to achieve good recognition accuracy. For languages with small character sets it may feasible to perform this process manually, but for languages with many thousands of characters, such as Chinese, manually collecting this data is often not practical.

This paper presents an unsupervised method to extract this data from two unaligned, unstructured, and noisy inputs: firstly, a corpus of transcribed documents; secondly, a corpus of scanned documents of the desired printing or writing style, some fraction of which are editions of texts included in the transcription corpus. The unsupervised procedure described is demonstrated capable of using this data, together with an OCR engine trained only on modern printed Chinese to retrain the same engine to recognize pre-modern Chinese texts with a 43% reduction in overall error rate.

[Full paper]

This entry was posted in Chinese, Digital Humanities. Bookmark the permalink.

Comments are closed.