Chinese Text Project: A Digital Library of Pre-Modern Chinese Literature

Paper presented at Digital Humanities Congress 2016, University of Sheffield

Since its creation in 2005 as an online search tool for a handful of classical Chinese texts, the Chinese Text Project has gradually grown to become the largest and most widely used digital library of pre-modern Chinese texts, as well as a platform for exploring the application of new digital methods to the study of pre-modern Chinese literature. This paper discusses how several unique aspects of the project have contributed to its success. Firstly it demonstrates how simplifying assumptions holding for domain-specific OCR (Optical Character Recognition) of historical works have made possible reductions in complexity of the task and thus led to increased recognition accuracy. Secondly it shows how crowd-sourced proofreading and editing using a publicly accessible version-controlled wiki system has made it possible to leverage a large and distributed audience and user base, including many volunteers located outside of traditional academia, to improve the quality of digital content and enable the creation of accurate transcriptions of previously untranscribed texts and editions. Finally, it explores how the implementation of open APIs (Application Programming Interfaces) has greatly expanded the utility of the library as a whole, facilitating open and decentralized integration with other projects, as well as leading to entirely new applications in digital humanities teaching and research.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Leveraging Corpus Knowledge for Historical Chinese OCR

Paper to be presented at “Digital Research in East Asian Studies: Corpora, Methods, and Challenges“, Leiden University, July 10 2016

Abstract

As an increasingly large amount of pre-modern Chinese writing is transcribed into digital form, the resulting digitized corpus comes to represent an ever larger fraction of the total body of extant pre-modern material. Additionally, many distinct items from the total set of pre-modern writings to which one might wish to apply OCR are either non-identical editions of the same abstract work, or commentaries on (and thus repeat much or all of the content of) earlier works. As a result, for historical OCR the probability that a text we wish to recognize contains extensive overlaps with what has previously been transcribed in another document is not only significant but also increases over time as more material is digitized. While general techniques for improving OCR accuracy using language modeling can also be applied successfully to historical OCR, it is also possible that more specialized techniques can take greater advantage of our more extensive knowledge of the historical corpus to further improve recognition accuracy. In this paper, I present an initial evaluation of unsupervised techniques that attempt to leverage knowledge extracted from a large existing corpus of pre-modern Chinese to improve OCR recognition accuracy on unseen historical documents.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Crowdsourcing, APIs, and a Digital Library of Chinese

Guest post published on Nottingham University’s China Policy Institute blog.

Digital methods have revolutionized many aspects of the study of pre-modern Chinese literature, from the simple but transformative ability to perform full-text searches and automated concordancing, through to the application of sophisticated statistical techniques that would be entirely impractical without the aid of a computer. While the methods themselves have evolved significantly – and continue to do so – one of the most fundamental prerequisites to almost all digital studies of Chinese literature remains access to reliable digital editions of these texts themselves.

Since its origins in 2005 as an online search tool for a small number of classical Chinese texts, the Chinese Text Project has grown to become one of the largest and most widely used digital libraries of pre-modern Chinese writing, containing tens of thousands of transmitted texts dating from the Warring States through to the late Qing and republican period, while also serving as a platform for the application of digital methods to the study of pre-modern Chinese literature. Unlike most digital libraries and full-text databases, users of the site are not passive consumers of its materials, but instead active curators through whose work it is maintained and developed – and increasingly, not all users of the library are human.

Digitization piece by piece

As libraries have increasingly come to recognize the value of digitizing historical works in their holdings, many institutions with significant collections of Chinese materials have committed themselves to large-scale scanning projects, often making the resulting images freely available over the internet. While an enormously positive development in itself, for many scholarly use cases this represents only the first step towards adequate digitization of these works. Scanned images of the pages of a book make its contents accessible in seconds rather than requiring a time-consuming visit to a physical library, but without a machine-readable transcription of the contents of each page, the reader must still navigate through the material one page at a time – finding a particular word or phrase in the work, for example, remains a time consuming task.

While Optical Character Recognition (OCR) – the process of automatically transforming an image containing text into digitally manipulable characters – can produce results of sufficient accuracy to be useful for full-text search, OCR inevitably introduces a significant number of transcription errors which can only be corrected by manual effort, particularly when applied to historical materials which may be handwritten, damaged, and faded. Proofreading the entire body of material potentially available – likely amounting to hundreds of millions of pages – would be prohibitively expensive, but omitting the proofreading step limits the utility of the data.


Variation in instances of the character “書” in texts from the Siku Quanshu. OCR software must correctly identify all of these instances as corresponding to the same abstract character – a challenging task for a computer.

In an attempt to address this problem, the Chinese Text Project has developed a hybrid system, in which uncorrected OCR results are imported directly into a database system providing full-text search of the source images and assembling the contents of the scanned images of pages into complete textual transcriptions, while also providing an integrated mechanism for users to directly correct the data. Like articles in Wikipedia, the contents of any transcription can be edited directly by any user; unlike Wikipedia, there is always a clear standard against which edits can easily be checked for correctness: the images of the source documents themselves. Proofread texts and uncorrected OCR texts are presented and manipulated in an identical manner within the database, with full-text search and image search available for both – the only distinction being that users are alerted to the possibility of errors in those texts still requiring editing. Volunteers located around the world correct mistakes and add modern punctuation to the texts as time allows and according to their own interests – typically hundreds of corrections are made each day.



Left: A scanned page of text with a transcription created using OCR and subsequently corrected by ctext.org users.
Right: The same data automatically assembled into a transcription of the entire text.

Library cards for machines: Application Programming Interfaces (APIs)

As digital libraries grow in size and scope, they also present increasingly valuable opportunities for research using novel methods including text mining, distant reading and other techniques that are often grouped under the label “digital humanities”. At the same time, what can in practice be achieved with individual projects and their associated tools and materials is frequently limited by the particular use cases envisioned by their creators when these resources were first designed and implemented. Application Programming Interfaces (APIs) – standardized mechanisms through which independently developed pieces of computer software are able to share data and functionality in real time – provide one approach to greatly increasing the flexibility and thus utility of such projects.

With these goals in mind, the Chinese Text Project has recently published its own API, which provides machine-readable export of data from any of the texts and editions in its collection, together with a mechanism to make external tools and resources directly accessible through its user interface in the form of user-installable “plugins”. While many of these have already been created – such as those for the MARKUS textual markup platform as well as a range of online Chinese dictionaries – the true value of such APIs lies in their flexibility, in particular their ability to be adapted to new resources and new use cases without requiring additional coordination or development work, often leading to their successful application to use cases quite unrelated to those for which they were first created.

While the Chinese Text Project API was developed primarily with the goal of facilitating online collaboration, it is now also being used to facilitate digital humanities teaching and research. In the spring semester of 2016, graduate students at Harvard University’s Department of East Asian Languages and Civilizations made extensive use of the API as part of the course Digital Methods for Chinese Studies, which introduced students with backgrounds in Chinese history and literature to practical programming and digital humanities techniques. By making use of the API, it was possible for students to obtain digital copies of precisely the texts they needed in exactly the format they required without the significant additional effort this would normally entail. Rather than working with set example texts for which data had been pre-compiled into the required format or spending classroom time dealing with uninteresting methods of data preparation, the API made it possible for students to directly access the texts most relevant to their own work in a consistent format with no additional work. For the same reasons of consistency, programs written to perform a given set of operations on one text could immediately be applied to any other text from the tens of thousands available through the API.


Part of a network graph representing single-character explanatory glosses given in the early character dictionary the Shuowen jiezi. Arrows indicate direction of explanation.

Conclusion

The application of digital techniques developed in other domains to humanities questions – in this case, of crowdsourcing and APIs to the simple but fundamental question “What does the text actually say?” – is characteristic of the emerging field of digital humanities. Collaboration – facilitated in this case by these same techniques – often plays an important role in such projects, due to the enormous amounts of data available, the scalability of digital techniques in comparison to individual manual effort, and the power of digital methods to help make sense of a volume of material larger than any individual could plausibly analyze by hand.

Donald Sturgeon is Postdoctoral Fellow in Chinese Digital Humanities and Social Sciences at Harvard University’s Fairbank Center for Chinese Studies, and editor of the Chinese Text Project.

Posted in Chinese, Digital Humanities | Comments Off

Classical Chinese Digital Humanities

Introducing the first in a series of online tutorials covering basic digital humanities techniques using the Python programming language and the Chinese Text Project API. These tutorials are based in part on material covered in the course CHNSHIS 202: Digital Methods for Chinese Studies, which I taught at Harvard University’s Department of East Asian Languages and Civilizations in Spring 2016.

Intended audience: People with some knowledge of Chinese literature and an interest in digital humanities; no programming experience necessary.

Format: Most of these tutorials will consist of a Jupyter Notebook file. These files contain a mixture of explanations and code that can be modified and run from within your web browser. This makes it very easy to modify, play with, and extend all of the example code. You can also read the tutorials online first (you’ll need to download the files in order to run the code and do the exercises though).

http://digitalsinology.org/classical-chinese-dh-getting-started/

Posted in Chinese, Digital Humanities | Comments Off

Text, Data, and Digital Humanities: APIs and the Chinese Text Project

Yale University, 22 April 2016

As databases, digital libraries, and digital tools grow in size and scope, they present increasingly valuable opportunities for research using novel methods including text mining, distant reading and other techniques that can be grouped under the heading “digital humanities”. At the same time, what can in practice be achieved technically using individual projects and their associated tools and materials is frequently limited by the types of use case envisioned by their creators when these resources were first designed and implemented.

Application Programming Interfaces (APIs) – standardized mechanisms through which independently developed pieces of computer software are able to share data and functionality in real time – provide one approach to greatly increasing the flexibility and thus utility of databases, digital libraries, and other tools. Key to the utility of such APIs is the possibility of functionality and content being reused in different ways by different users, without requiring central implementation of a new mechanism for each use case.

In this talk I describe how the implementation of existing third-party APIs as well as the development of a new special-purpose API for the Chinese Text Project, an online database and digital library of pre-modern Chinese texts, has opened up new opportunities for fast, efficient, and easy to use repurposing of data in a variety of contexts. These include user-driven integration with other online tools and resources (including both those already available, and those still to be constructed), statistical textual analysis and natural language processing research, and teaching and research in Chinese digital humanities.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Automated Identification of Parallels and Allusions in Classical Chinese Texts

Paper presented at AAS 2016, Seattle, April 1, 2016

The classical Chinese corpus has long been recognized to contain a vast amount of text reuse: closely related textual content that, for a variety of reasons, occurs in multiple works that might otherwise be considered to be quite independent creations ascribed to entirely different authors. Although this reuse occasionally involves explicit citation of a particular work, or acknowledgment that what follows is a widely known saying as opposed to an original invention of the author, far more often no indication is given that a passage may have been borrowed from elsewhere. Identifying such instances of reuse can shed light upon difficult issues of authorship and textual history, as well as highlight textual variations that can provide clues to the interpretation of obscure or disputed passages.

Digital methods make possible the exploration and analysis of text reuse not only in isolated instances, but systematically across a corpus of works as a whole. In this paper I propose methods of identifying two distinct types of text reuse in the classical Chinese corpus and provide an evaluation of the degrees of accuracy achieved. The first is overtly similar or “parallel” passages, which can be reliably located by defining and maximizing appropriate similarity metrics over regions of text. The second is less direct allusion to the content of earlier works, and is considerably more challenging to identify. I propose an approach that makes use of information retrieval and machine learning techniques, while also leveraging statistical data derived from the more easily identified “parallel” passages.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

CHNSHIS 202: Digital Methods for Chinese Studies

I currently (Spring 2016 and 2017; Fall 2017) teach the course CHNSHIS 202: Digital Methods for Chinese Studies at Harvard’s EALC. Below is the syllabus from the 2016 course.

Course Description

This course introduces graduate students in Chinese studies to programming skills and digital humanities techniques of direct practical relevance to research in their discipline. It will consist of weekly lectures, each introducing a specific type of technique, followed by an interactive lab session during which students practice applying the technique to data appropriate to their own research. No background in digital methods or programming is assumed, but students are expected to have basic computing skills and are required to bring a suitable laptop to use during the lab sessions. The techniques covered in this course all have broad applicability to topics in Chinese studies, and students will be expected to apply them to their own research topics and relevant texts as arranged during the first few sessions. The course will end with student presentations in which students apply an appropriate selection of the techniques studied to their own research questions.While examples and coursework will draw upon Chinese language source materials, students primarily working with other East Asian languages are also encouraged to take this course.

Schedule

Week 1 – Introduction

  • Background and basic concepts
  • Representing text on a computer
  • Setting up the Python environment

Week 2 – Introduction to programming

  • Variables, functions, loops, and files

Week 3 – Regular expressions

  • String manipulation and data extraction.

Week 4 – Working with structured data

  • Associative arrays, tables, CSV files

Week 5 – Practical data manipulation

  • Automated extraction of data from the web

Week 6 – Textual similarity

  • Introduction to information retrieval

Week 7 – Topic modeling

  • Generating and interpreting data using Mallet

Week 8 – Network visualization with Gephi

  • Representing data as a network graph

Week 9 – Principal component analysis

  • Exploratory data analysis in Python

Week 10 – Machine learning

  • Features, classification, regression

Week 11 – Review and discussion

  • What worked, what didn’t, and why
  • Debugging of issues arising during project work

Week 12 – Student presentations and discussion
 

Coursework and Assessment

  • Class participation (30%)
    Students are expected to attend and actively participate in the practical sessions, completing short assigned problem sets, and applying techniques introduced to their own data.
  • Homework assignments (30%)
    Four short homework assignments will be set based upon the application of digital techniques covered.
  • Final presentations (40%)
    Each student will give one presentation in which techniques introduced during the course are applied to a research topic in Chinese studies.

Learning Outcomes

Having completed this course, students will:

  • Have an understanding of how to apply digital techniques to their own projects.
  • Be able to apply basic programming techniques to extract data from Chinese texts for analysis, and perform various kinds of digital analysis on the resultant data in the context of their research.
  • Possess the basic skills needed to make use of the growing number of open-source Python libraries relevant to textual analysis.  
Posted in Chinese, Courses, Digital Humanities | Comments Off

Automation and Collaboration: Exploiting the Digital Medium

6th International Conference of Digital Archives and Digital Humanities,
30 November 2015, National Taiwan University
New Perspectives on Digital Sinology Resources panel

The digital medium presents unique opportunities and challenges for the development of new kinds of resources for the study of Chinese literature. Using examples from the Chinese Text Project, I suggest ways in which digital libraries can leverage the advantages of the digital realm to offer new functionality and services at relatively low cost. This involves the exploitation of two primary avenues for scalable development: firstly the use of automation to achieve goals realistically attainable by computational methods, and secondly the encouragement of open user engagement to recruit human volunteers to assist with tasks less suited to automation.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Towards a Scalable Digital Library of Pre-Modern Chinese: From Static Database to Evolving Platform

Presentation at Harvard University, “Advancing Digital Scholarship in Japanese Studies: Innovations and Challenges” Workshop, 7 November 2015
Belfer Case Study Room, CGIS, 9.00 am

In the ten years since first going online, the Chinese Text Project has gradually expanded from a simple tool for searching and navigating a handful of early Chinese texts to become the largest publicly available full-text database of pre-modern Chinese, containing over 20,000 texts and more than 3 billion characters. In this presentation, I discuss technical and structural changes that have made this expansion possible with only limited resources. These changes involve the exploitation of two primary avenues for scalable development: firstly the use of automation to achieve goals realistically attainable by computational methods, and secondly the encouragement of open user engagement to recruit human volunteers to assist with tasks less suited to automation. Specific examples include the application of optical character recognition to both enable full-text search of scanned early editions as well as create draft transcriptions of the same texts that can be proofread by crowd-sourcing, and of natural language processing techniques to the identification of text reuse and automated compilation of dictionary data. I also introduce ongoing work including the development of Application Programming Interfaces (APIs) and related mechanisms that will allow other projects to integrate with and build upon the resources of this digital library in a decentralized way while at the same time avoiding duplication of effort.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Textual Relationships in the Pre-Qin and Han Corpus: A Digital Approach

Seminar at Harvard University, Fairbank Center for Chinese Studies, 26 October 2015
Room S153, CGIS South Building, 12.00

Textual parallels among early Chinese transmitted texts are extensive and widespread, often reflecting complex textual histories involving repeated transcription, compilation, and editing spanning many centuries and involving contributions from multiple authors and editors. Partly as a consequence of this complexity, establishing with certainty even approximate dates of authorship for texts and parts of texts is a challenging task. In this presentation, I demonstrate how digital methods grounded in textual and statistical evidence can help us better understand and visualize some of these complex relationships, and how digital methods may offer additional clues as to the likely provenance of disputed texts.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off