Subtitled “OCR, crowdsourcing, and text mining of Chinese historical texts”
Paper to be presented at the CADAL Project Work Conference on Digital Resources Sharing and Application, Zhejiang University, 16 June 2017.
In this talk I present an overview of key technologies used in the Chinese Text Project, one of the largest digital libraries of pre-modern Chinese transmitted texts, the public user interface of which is currently used by over 25,000 people every day. Key technologies used fall into three main categories: Optical Character Recognition (OCR) for pre-modern Chinese texts, a practical and successful crowdsourcing interface taking advantage of a large base of users, and an open Application Programming Interface allowing both integration with other online tools and projects as well as open-ended use for text mining purposes.
Firstly, specialized OCR techniques have been developed for pre-modern Chinese texts. These techniques leverage aspects of common writing and printing styles, together with a large existing body of transcribed textual material, to implement an OCR pipeline with high accuracy and scalability. These techniques have so far been applied to over 25 million pages of pre-modern Chinese texts, and the results made freely available online.
Secondly, a unique crowdsourcing interface for editing texts created primarily via OCR enables users to correct mistakes and add additional information and metadata, allowing users around the world to meaningfully and immediately contribute to the project and to actively participate in the curation of its contents. Hundreds of corrections are received and immediately applied to the version controlled texts every day by users based around the world.