Paper to be presented at the Open Conference on Digital Infrastructures for Global Philology, Leipzig University, 21 February 2017.
This paper describes the current status and initial results of an ongoing project to create a scalable and sustainable infrastructure for the transcription, curation, use and distribution of pre-modern Chinese textual material. The material created is accessed through a purpose-built web interface (http://ctext.org) by around 25,000 individual users every day; this interface currently ranks as one of the 3000 most frequently visited websites on the Internet in both Taiwan and Hong Kong. While also offering full-text database functionality, from an infrastructural perspective the project is composed of three main components, each designed to be usable individually or in combination to fulfil a diverse set of use cases.
The first of these is a practical Optical Character Recognition (OCR) procedure for historical Chinese documents. OCR for pre-modern Chinese is challenging for a number of technical reasons, including the large numbers of distinct characters involved, but the pre-modern domain also offers potential advantages, including opportunities for taking advantage of features relatively constant across such pre-modern works, such as standardized layouts and writing conventions, and the possibility of leveraging text reuse to improve OCR performance. Given the large volume of extant material together with the rate at which libraries and other scanning centers are scanning pre-modern Chinese works, OCR represents the only practical means by which to transcribe many of these texts in the short to medium term – particularly when considering the “long tail” of less popular and less mainstream material. So far the procedure described has been applied to over 25 million pages of historical texts, including most recently 5 million pages from the Harvard Yenching Library collection, and the results released online.
The second component is an open, online crowdsourcing interface allowing the ongoing correction of such textual transcriptions. Transcriptions created using OCR are imported into this system, which immediately enables their use for full-text image search, and at the same time encourages users to correct mistakes in OCR output as they encounter them. Submitted corrections are applied immediately, and logged in a version control system providing appropriate visualizations of changes made; the system currently receives hundreds of user-generated corrections of this type each day. Users are able to correct errors introduced by the OCR procedure, as well as supplement these results with additional data such as punctuation (typically not recorded in the scanned texts) and markup describing logical structure. Metadata curation is also integrated into the crowdsourcing system.
The third component is an open Application Programming Interface (API) allowing access to full-text data and metadata created and curated through OCR and crowdsourcing as well as by other means. This provides access to machine-readable data about texts and their contents in a flexible way. In order to encourage use of the API to allow better integration with other online projects, in addition to the API itself an open plugin system has been developed, allowing users to extend the user interface of the system in flexible ways and link it to external projects without requiring central coordination or approval, as well as to freely share these extensions with other users. Both the API and plugin system are already in active use, enabling concrete collaboration and decentralized integration with projects based at Leiden University, Academia Sinica, and many others. As the API also allows machine-readable access to what is now the world’s largest database of pre-modern Chinese writing, it also has obvious applications in the fields of text mining and digital humanities. In order to further facilitate such use of the data in research and teaching, a Python library is also available; the API together with this library are currently used to facilitate digital humanities teaching at Harvard’s Department of East Asian Languages and Civilizations.