Paper to be presented at “Digital Research in East Asian Studies: Corpora, Methods, and Challenges“, Leiden University, July 10 2016
As an increasingly large amount of pre-modern Chinese writing is transcribed into digital form, the resulting digitized corpus comes to represent an ever larger fraction of the total body of extant pre-modern material. Additionally, many distinct items from the total set of pre-modern writings to which one might wish to apply OCR are either non-identical editions of the same abstract work, or commentaries on (and thus repeat much or all of the content of) earlier works. As a result, for historical OCR the probability that a text we wish to recognize contains extensive overlaps with what has previously been transcribed in another document is not only significant but also increases over time as more material is digitized. While general techniques for improving OCR accuracy using language modeling can also be applied successfully to historical OCR, it is also possible that more specialized techniques can take greater advantage of our more extensive knowledge of the historical corpus to further improve recognition accuracy. In this paper, I present an initial evaluation of unsupervised techniques that attempt to leverage knowledge extracted from a large existing corpus of pre-modern Chinese to improve OCR recognition accuracy on unseen historical documents.