Subtitled “OCR, crowdsourcing, and text mining of Chinese historical texts”
Paper to be presented at the CADAL Project Work Conference on Digital Resources Sharing and Application, Zhejiang University, 16 June 2017.
数字人文与数字图书馆:中国历代文献的文字识别、群众外包及文本挖掘
本次演讲介绍中国哲学书电子化计划中的主要技术。中国哲学书电子化计划是全球最大规模的前现代中文传世文献电子图书馆之一,目前,每日有25,000多用户使用其公开操作界面。主要原创技术可归类为三种:(一)前现代中文资料的文字识别技术(OCR)、(二)借用大量用户劳力的群众外包界面、(三)既实现与其它线上工具之间的整合、又提供文本挖掘途径的开放式应用程式界面(API)。
第一个原创技术是专门为中国前现代文献设计的文字识别技术。此技术利用前现代文献常见的写作、印刷特征以及已数字化的大量文献来实现具有高精确性以及扩充性的文字识别系统。该系统已处理2,500多万页资料,其结果已在网络上公开。
第二,通过独特的群众外包界面,世界各地的用户可纠正文字识别错误,补充后设资料,从而能够及时参与数字化过程并积极协助内容的扩展。全球用户每日提供上百次的校勘,系统将此及时储存到具有版本控制功能的数据库。
第三,系统的应用程式界面可用于文本挖掘,亦可用于扩充一般使用界面的功能,
从而有效地借用日益增长的资料库文本内容来达到数字人文研究和教学的目的。通过此应用程式界面,为Python等程式语言所开发的专门组件可用于数字人文教学;JavaScript组件便于他人开发易用的线上工具,使他人所开发的应用工具能够直接读取和操作电子图书馆中的各种内容。
In this talk I present an overview of key technologies used in the Chinese Text Project, one of the largest digital libraries of pre-modern Chinese transmitted texts, the public user interface of which is currently used by over 25,000 people every day. Key technologies used fall into three main categories: Optical Character Recognition (OCR) for pre-modern Chinese texts, a practical and successful crowdsourcing interface taking advantage of a large base of users, and an open Application Programming Interface allowing both integration with other online tools and projects as well as open-ended use for text mining purposes.
Firstly, specialized OCR techniques have been developed for pre-modern Chinese texts. These techniques leverage aspects of common writing and printing styles, together with a large existing body of transcribed textual material, to implement an OCR pipeline with high accuracy and scalability. These techniques have so far been applied to over 25 million pages of pre-modern Chinese texts, and the results made freely available online.
Secondly, a unique crowdsourcing interface for editing texts created primarily via OCR enables users to correct mistakes and add additional information and metadata, allowing users around the world to meaningfully and immediately contribute to the project and to actively participate in the curation of its contents. Hundreds of corrections are received and immediately applied to the version controlled texts every day by users based around the world.
Thirdly, the creation of a specialized API for text mining use and extension of the primary user interface enables efficient access to the ever-growing data set for use in digital humanities research and teaching. Creation of specialized modules for programming languages such as Python allows for intuitive use in digital humanities teaching contexts, while simple access via JavaScript enables the creation of easy-to-use online tools which can directly access and operate on textual materials stored in the library.