Practical introduction to

This tutorial briefly summarizes some of the most common tasks on the Chinese Text Project database and digital library from a user perspective, with suggested example tasks intended to introduce core functionality of the system.

Initial setup

  • Create an account: Scroll down to the bottom of the left-hand pane and click “Log in”, then fill out the “If you do not have an account…” section.
  • Check font support: Under “About the site” near the top-left corner, click on “Font test page”.

Finding texts

  • Use the “Title search” function on the left-hand pane.
  • Texts that display the “” icon in title search results are linked to scanned sources.
  • Key to icons used in title search results:
    Transcription in the textual database (not user editable).
    User-editable transcription, not generated using OCR.
    User-editable transcription generated using OCR.
    Scanned copy of a particular edition of a text.
  • Examples:
    • Locate a transcription of the 資暇集.
    • Locate a pre-Qin or Han dynasty text in the textual database.

Full-text search

  • First locate and open the transcription of the text (or chapter/juan) you wish to search in, then use the box labeled “Search” near the bottom of the left-hand pane.
  • Examples:
    • Locate the passage in the Analects where Confucius says “君子不器”.
    • Locate all passages in the Zhuangzi where “道” is mentioned.
  • When you search a text in the textual database and get many results, you can use the “Show statistics” link at the top-right to display an interactive summary of where the results appear.

Locating text in scanned primary sources

  • On ctext, scanned representations of texts are searched by means of linked transcriptions. If a transcription has a linked text, it will be display the “” icon in the title search results.
  • Where a text is linked to a scan, clicking the “” icon to the left of any paragraph of text opens the corresponding page of the scan.
  • To search for a specific term or phrase in a scanned text, search the associated transcription for the term or phrase, then click the “” icon to the left of the search result.
  • Particularly in cases where transcriptions have been created using OCR, errors in the transcription may mean that longer phrases are not matched. Try searching for a shorter phrase, or using words that should appear nearby the text you are looking for.
  • Examples:
    • Locate a text with scan links, and try searching and viewing the results within the scanned representation.
    • Try doing this with an OCR-derived transcription
    • Texts can also be searched from the “Library” section of the site containing the scanned texts – this produces exactly the same result as searching the linked transcription.
    • You can also navigate through the scanned representation page by page using the links provided.

Finding parallels to a passage of text

Available for pre-Qin and Han texts and Leishu in the textual database.

  • Locate a passage of text, then click the “” icon to display the parallel summary.
  • Within the results, click the “” icon beside a heading to display each result along with the context in which it occurs.

Finding parallels between two particular texts

Available for pre-Qin and Han texts and Leishu in the textual database.

  • Click the “Advanced search” link towards the bottom of the left-hand pane.
  • In the section labeled “1. Scope”, select the first category, text, or textual unit you wish to search in. For example, to search within the Zhuangzi, you would choose “Pre-Qin and Han”, “Daoism”, then “Zhuangzi” (and leave the fourth box with “[All]” in it).
  • In the section labeled “3. Search parameters”, tick the box under “Parallel passage search”, and select in the category, text, or textual unit you wish to locate parallels to in the same way as the previous step.
  • Click “Search”. The results will list all passages containing parallels.
  • Examples:
    • Try locating all parallels between the Analects and texts in the “Daoism” category.
    • When you have the results, try clicking the “Show statistics” link.
    • Perform the same search again, but with the “Scope” and “Search parameters” reversed, and again try using “Show statistics”.

Locating text by concordance number

Available for texts with concordance data.

  • First open the contents page for the transcription of the text. On the right hand side of the page, one search box will be displayed for each type of concordance number supported for that text.
  • Example:
    • In this paper by Eric Hutton, the author uses concordance numbers from both the ICS series and the Harvard Yenching series to identify textual references without quoting the Chinese text, e.g. here:

      Use the concordance lookup function on to locate the original Chinese corresponding to the passage the author translates and references in footnote 17 above in Xunzi.

Getting the concordance number for a piece of text

Available for texts with concordance data.

  • Click the “” icon to the left of any passage with concordance data available.
  • Moving the mouse over the displayed passage will cause all concordance lines to which that segment of text belongs to be displayed.
  • To display only those concordance lines relevant to a particular part of the passage, click and drag with the mouse to highlight part of the text (which will appear in green). All concordance lines containing any of the highlighted section of text will be listed.
  • Example:
    • Following on from the previous example, identify the concordance lines which correspond to the line “人之性惡,其善者偽也。” in the Xunzi.

Viewing text and translation side by side

Available for texts with an English translation.

  • Normally when viewing texts with translations, one (potentially quite long) paragraph of Chinese is displayed, followed by the corresponding paragraph of English. It is also possible to display the text and translation aligned much more closely (usually phrase by phrase). To do this, click the “” icon to the left of a passage of text. This function also displays available dictionary information when the mouse cursor is moved over the Chinese text.
  • For long passages, it may be useful to jump straight to the part of the translation corresponding to a particular part of the Chinese text. To do this, search the text for a Chinese phrase, then click the “” icon as before.
  • Example:
    • Experiment with the text of the Zhuangzi.
    • Use this to quickly see how James Legge translated the line “每至於族,吾見其難為,怵然為戒,視為止,行為遲” in that same text.

Viewing commentary

Available for selected texts including the Analects, Mengzi, Mozi, Dao De Jing

  • Click the “” icon to the left of a passage of text. Note that commentaries are also independent texts, so you can use the links in the displayed commentary to switch to reading the commentary instead of the uncommented text.
  • Example:
    • Experiment with the Analects, Mengzi, Mozi, or Dao De Jing.

Locating/inputting obscure and variant characters

  • Open the “Dictionary” section of the site.
  • Depending on the character, you may want to use:
    • Direct input, i.e. just type the character in
    • Component lookup – refer to the brief instructions on the main dictionary page.
    • Radical lookup – first select the radical, then look use the additional stroke count. You can increase the size of the displayed characters by clicking on the “n strokes” label.
  • Some examples: 䊫, 𥼺, 𧤴, … – try locating these on (without copying and pasting from this page!)
  • Hint: Where you do not have an easy method of inputting either component, you can instead input any other character containing that component, use its decomposition to locate the component, and then search for other characters which also contain that component.
  • Support for variant characters which do not exist in Unicode is being added to These can currently be looked up by components only.
    • Non-Unicode characters can be copied and pasted for use within ctext. When a non-Unicode character is copied, it becomes a “ctext:nnnn” identifier (e.g. ctext:1591). Pasting this into other software (e.g. Microsoft Word) will paste this identifier, not a character or image.
    • You can, however, copy the image of the character from ctext by right-clicking on the character and choosing “Copy image”, and paste this into a Word document. Please remember to cite as the source of the image (e.g. by referencing the “ctext:nnnn” identifier, or providing its URL).
    • Examples: ctext:4543 ctext:8668 ctext:3000 ctext:335

Editing a transcription

The easiest way to correct transcriptions for texts which are linked to scans is to use the “Quick edit” function. To do this:

  • Locate the page of the scan on which the transcription error occurs.
  • Click the “Quick edit” link.
  • An editable transcription of the content of that page will appear. In general, each line of the transcription should correspond to one column of the scanned text.
  • Carefully modify the transcription to agree with the scan, and click “Save changes” when done.
  • If spaces are necessary, be sure to use full-width Chinese spaces and not half-width English spaces.
  • Example:
    • Choose a transcription which has been created automatically with OCR, and make a correction to it.
  • “Versioning” – recording every change made to each text, and maintaining the option to revert to a previous state – is fundamental to the operation of any wiki system. After you have saved your edit:
    • Open the textual transcription that you edited by clicking the “View” link.
    • Scroll up to the top of that page and click “History” to display the list of recent revisions; your recent edit should be listed at the top.
    • Each row represents the state of the transcription after a particular edit was made. Two edits can be compared by selecting the two using the two sets of radio buttons at the left of the table and clicking “Compare”; the default selections always compare the most recent edit with the state of the text prior to that edit. Click “Compare” to a visualization of the edit you just made.

Install and use a plugin

Plugins allow customization of the site’s user interface to support additional functionality. Common use cases include downloading textual data, and connecting to third-party character dictionaries. In order to use a plugin, you must first install it into your account (you only need to do this once for each plugin). To install a plugin:

  • Open “About the site” > “Tools” > “Plugins”.
  • Locate the plugin you wish to add and click “Install”.
  • Click “Install” on the confirmation page which appears.

Once installed, when you open any supported object on (e.g. a chapter of text for a “book” or “chapter” plugin, or a character in the dictionary for a “character” or “word” type plugin), a corresponding link will be displayed in a bar near the top of the screen.

  • Install the “Plain text” plugin and use it to export a chapter of a text.
  • Install the “Frequencies” plugin and use it to view character frequencies in a chapter of a text.
  • Install any plugin with the “character” or “word” type, look up a character in the dictionary, and use the plugin to access the external dictionary.

Advanced topics

The following introduce more advanced topics, which require additional effort and/or additional technical skills beyond the scope of this tutorial.

Creating a new plugin

Plugins are programmatic descriptions in XML of a method of connecting the user interface to an external resource (typically another website). New plugins can be created directly from within your account by modifying the code for an existing plugin. To see what your installed plugins look like, click on “Settings” in the left hand pane, then click the “editing your XML plugin file” link.

You can create a new plugin by duplicating the code between the “<Plugin>…</Plugin>” tags and editing the duplicate. You should remove the “<Update>” tag from your new plugin, as this may otherwise cause it to be overwritten in future. You can also create a standalone XML file (refer to the many existing examples), host it on your own server, and then install it into your account.

If you encounter issues with your new plugin or the code is not accepted by the ctext interface, you can use the W3C Markup Validator to confirm that your plugin file is valid. A valid plugin file should give a green “This document was successfully checked as CTPPlugin!” result which looks like this.

Programmatic access

Textual material from the site can also be accessed directly from a programming language such as Python to use for text mining and digital humanities purposes. This requires some additional setup and investment of time to achieve, particularly if you have not programmed before, but step-by-step instructions are available online.

Programmatic access is facilitated by the Application Programming Interface (API), and can be achieved from any language or environment capable of sending HTTP requests. Python is particularly recommended, because a ctext Python module exists which can be used to access the API with very little work. In addition to the general documentation for the API, documentation for all API functions is available and includes working live examples of each.

Creative Commons License
Posted in Chinese, Digital Humanities | Comments Off

Towards a sustainable digital infrastructure for historical Chinese texts

Paper to be presented at the Open Conference on Digital Infrastructures for Global Philology, Leipzig University, 21 February 2017.
[Download slides]

This paper describes the current status and initial results of an ongoing project to create a scalable and sustainable infrastructure for the transcription, curation, use and distribution of pre-modern Chinese textual material. The material created is accessed through a purpose-built web interface ( by around 25,000 individual users every day; this interface currently ranks as one of the 3000 most frequently visited websites on the Internet in both Taiwan and Hong Kong. While also offering full-text database functionality, from an infrastructural perspective the project is composed of three main components, each designed to be usable individually or in combination to fulfil a diverse set of use cases.

The first of these is a practical Optical Character Recognition (OCR) procedure for historical Chinese documents. OCR for pre-modern Chinese is challenging for a number of technical reasons, including the large numbers of distinct characters involved, but the pre-modern domain also offers potential advantages, including opportunities for taking advantage of features relatively constant across such pre-modern works, such as standardized layouts and writing conventions, and the possibility of leveraging text reuse to improve OCR performance. Given the large volume of extant material together with the rate at which libraries and other scanning centers are scanning pre-modern Chinese works, OCR represents the only practical means by which to transcribe many of these texts in the short to medium term – particularly when considering the “long tail” of less popular and less mainstream material. So far the procedure described has been applied to over 25 million pages of historical texts, including most recently 5 million pages from the Harvard Yenching Library collection, and the results released online.

The second component is an open, online crowdsourcing interface allowing the ongoing correction of such textual transcriptions. Transcriptions created using OCR are imported into this system, which immediately enables their use for full-text image search, and at the same time encourages users to correct mistakes in OCR output as they encounter them. Submitted corrections are applied immediately, and logged in a version control system providing appropriate visualizations of changes made; the system currently receives hundreds of user-generated corrections of this type each day. Users are able to correct errors introduced by the OCR procedure, as well as supplement these results with additional data such as punctuation (typically not recorded in the scanned texts) and markup describing logical structure. Metadata curation is also integrated into the crowdsourcing system.

The third component is an open Application Programming Interface (API) allowing access to full-text data and metadata created and curated through OCR and crowdsourcing as well as by other means. This provides access to machine-readable data about texts and their contents in a flexible way. In order to encourage use of the API to allow better integration with other online projects, in addition to the API itself an open plugin system has been developed, allowing users to extend the user interface of the system in flexible ways and link it to external projects without requiring central coordination or approval, as well as to freely share these extensions with other users. Both the API and plugin system are already in active use, enabling concrete collaboration and decentralized integration with projects based at Leiden University, Academia Sinica, and many others. As the API also allows machine-readable access to what is now the world’s largest database of pre-modern Chinese writing, it also has obvious applications in the fields of text mining and digital humanities. In order to further facilitate such use of the data in research and teaching, a Python library is also available; the API together with this library are currently used to facilitate digital humanities teaching at Harvard’s Department of East Asian Languages and Civilizations.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Deep Dive into Digital and Data Methods for Chinese Studies

I’m really looking forward to taking part in the University of Michigan’s “Deep Dive into Digital and Data Methods for Chinese Studies” series later this month, where I’ll be leading the following sessions:

Text Reuse in Early Chinese Texts: A Digital Approach (Lecture)
Monday, Feb. 13, 2017 12:00 pm – 1:00 pm
Location: Clark Library Instructional Space (240 Hatcher Graduate Library)

Chinese Text Project: Historical Texts in a Digital Age (Workshop)
Monday, Feb. 13, 2017 3:30 pm – 5:00 pm
Location: Hatcher Gallery Lab (100 Hatcher Library)

Practical Large-Scale OCR of Historical Chinese Documents (Presentation and Roundtable Discussion)
Tuesday, Feb. 14, 2017 2:30-3:30 pm
Location: Asia Library Conference Room (421A Hatcher Library North)

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Classical Chinese Literature in a Digital Age

I’m very excited to be visiting Tsukuba University in Japan next week, where I will be giving a talk titled “Classical Chinese Literature in a Digital Age” (December 15), and also presenting a paper on “Optical Character Recognition for pre-modern Chinese Texts” at a Digital Humanities workshop (December 16).

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Towards a dynamic, scalable digital library of pre-modern Chinese

Paper to be presented at the 7th International Conference of Digital Archives and Digital Humanities, December 2016, National Taiwan University

This paper contrasts two radically different approaches to full-text digital library design and implementation: firstly, the “static database approach”, in which materials are firstly created, edited, and manually reviewed before being added to a generally static database system; secondly, dynamic approaches in which incompletely reviewed materials are imported into a dynamic system providing similar functionality, but within which significant further editing is intended to take place. To illustrate the technical challenges, benefits, and practical consequences of these two design approaches as reflected in a large-scale digital system, specific examples are drawn from the Chinese Text Project digital library, which initially began as a primarily static database system, and has over time evolved into a primarily dynamic platform. This change has been motivated in particular by a desire to achieve a scalable, sustainable platform for the curation of textual data and metadata, to which new material can be easily added as well as improved over time, while requiring minimal administrative overhead. This paper argues that while there are technical challenges to a dynamic approach, the increase in scalability dynamic approaches offer can have significant advantages, including potential access to a “long tail” of data which might otherwise in practice be overlooked.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Harvard Yenching Library Chinese materials added to

Update to the CTP:

Thanks to the support of Harvard Yenching Library, over 5 million pages of scanned materials from the Yenching Library collection have been added to the Library section of the site, including high quality images from the Chinese Rare Books Collection. Approximate transcriptions created using the OCR procedure have also been added to the Wiki, making these materials full-text searchable. In future we hope to collaborate with other libraries to include materials from their Chinese language collections.

Posted in Chinese, Digital Humanities | Comments Off

Stanford DHAsia 2017

I’m delighted to be taking part in Stanford’s exciting DHAsia Digital Humanities initiative in the coming year.

I will be giving a talk titled “Parallels and Allusions in Early Chinese Texts: A Digital Approach” (April 25), as well as leading a workshop session “Chinese Text Project: Historical Texts in a Digital Age” (April 27).

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Chinese Text Project: A Digital Library of Pre-Modern Chinese Literature

Paper presented at Digital Humanities Congress 2016, University of Sheffield

Since its creation in 2005 as an online search tool for a handful of classical Chinese texts, the Chinese Text Project has gradually grown to become the largest and most widely used digital library of pre-modern Chinese texts, as well as a platform for exploring the application of new digital methods to the study of pre-modern Chinese literature. This paper discusses how several unique aspects of the project have contributed to its success. Firstly it demonstrates how simplifying assumptions holding for domain-specific OCR (Optical Character Recognition) of historical works have made possible reductions in complexity of the task and thus led to increased recognition accuracy. Secondly it shows how crowd-sourced proofreading and editing using a publicly accessible version-controlled wiki system has made it possible to leverage a large and distributed audience and user base, including many volunteers located outside of traditional academia, to improve the quality of digital content and enable the creation of accurate transcriptions of previously untranscribed texts and editions. Finally, it explores how the implementation of open APIs (Application Programming Interfaces) has greatly expanded the utility of the library as a whole, facilitating open and decentralized integration with other projects, as well as leading to entirely new applications in digital humanities teaching and research.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Leveraging Corpus Knowledge for Historical Chinese OCR

Paper to be presented at “Digital Research in East Asian Studies: Corpora, Methods, and Challenges“, Leiden University, July 10 2016


As an increasingly large amount of pre-modern Chinese writing is transcribed into digital form, the resulting digitized corpus comes to represent an ever larger fraction of the total body of extant pre-modern material. Additionally, many distinct items from the total set of pre-modern writings to which one might wish to apply OCR are either non-identical editions of the same abstract work, or commentaries on (and thus repeat much or all of the content of) earlier works. As a result, for historical OCR the probability that a text we wish to recognize contains extensive overlaps with what has previously been transcribed in another document is not only significant but also increases over time as more material is digitized. While general techniques for improving OCR accuracy using language modeling can also be applied successfully to historical OCR, it is also possible that more specialized techniques can take greater advantage of our more extensive knowledge of the historical corpus to further improve recognition accuracy. In this paper, I present an initial evaluation of unsupervised techniques that attempt to leverage knowledge extracted from a large existing corpus of pre-modern Chinese to improve OCR recognition accuracy on unseen historical documents.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Crowdsourcing, APIs, and a Digital Library of Chinese

Guest post published on Nottingham University’s China Policy Institute blog.

Digital methods have revolutionized many aspects of the study of pre-modern Chinese literature, from the simple but transformative ability to perform full-text searches and automated concordancing, through to the application of sophisticated statistical techniques that would be entirely impractical without the aid of a computer. While the methods themselves have evolved significantly – and continue to do so – one of the most fundamental prerequisites to almost all digital studies of Chinese literature remains access to reliable digital editions of these texts themselves.

Since its origins in 2005 as an online search tool for a small number of classical Chinese texts, the Chinese Text Project has grown to become one of the largest and most widely used digital libraries of pre-modern Chinese writing, containing tens of thousands of transmitted texts dating from the Warring States through to the late Qing and republican period, while also serving as a platform for the application of digital methods to the study of pre-modern Chinese literature. Unlike most digital libraries and full-text databases, users of the site are not passive consumers of its materials, but instead active curators through whose work it is maintained and developed – and increasingly, not all users of the library are human.

Digitization piece by piece

As libraries have increasingly come to recognize the value of digitizing historical works in their holdings, many institutions with significant collections of Chinese materials have committed themselves to large-scale scanning projects, often making the resulting images freely available over the internet. While an enormously positive development in itself, for many scholarly use cases this represents only the first step towards adequate digitization of these works. Scanned images of the pages of a book make its contents accessible in seconds rather than requiring a time-consuming visit to a physical library, but without a machine-readable transcription of the contents of each page, the reader must still navigate through the material one page at a time – finding a particular word or phrase in the work, for example, remains a time consuming task.

While Optical Character Recognition (OCR) – the process of automatically transforming an image containing text into digitally manipulable characters – can produce results of sufficient accuracy to be useful for full-text search, OCR inevitably introduces a significant number of transcription errors which can only be corrected by manual effort, particularly when applied to historical materials which may be handwritten, damaged, and faded. Proofreading the entire body of material potentially available – likely amounting to hundreds of millions of pages – would be prohibitively expensive, but omitting the proofreading step limits the utility of the data.

Variation in instances of the character “書” in texts from the Siku Quanshu. OCR software must correctly identify all of these instances as corresponding to the same abstract character – a challenging task for a computer.

In an attempt to address this problem, the Chinese Text Project has developed a hybrid system, in which uncorrected OCR results are imported directly into a database system providing full-text search of the source images and assembling the contents of the scanned images of pages into complete textual transcriptions, while also providing an integrated mechanism for users to directly correct the data. Like articles in Wikipedia, the contents of any transcription can be edited directly by any user; unlike Wikipedia, there is always a clear standard against which edits can easily be checked for correctness: the images of the source documents themselves. Proofread texts and uncorrected OCR texts are presented and manipulated in an identical manner within the database, with full-text search and image search available for both – the only distinction being that users are alerted to the possibility of errors in those texts still requiring editing. Volunteers located around the world correct mistakes and add modern punctuation to the texts as time allows and according to their own interests – typically hundreds of corrections are made each day.

Left: A scanned page of text with a transcription created using OCR and subsequently corrected by users.
Right: The same data automatically assembled into a transcription of the entire text.

Library cards for machines: Application Programming Interfaces (APIs)

As digital libraries grow in size and scope, they also present increasingly valuable opportunities for research using novel methods including text mining, distant reading and other techniques that are often grouped under the label “digital humanities”. At the same time, what can in practice be achieved with individual projects and their associated tools and materials is frequently limited by the particular use cases envisioned by their creators when these resources were first designed and implemented. Application Programming Interfaces (APIs) – standardized mechanisms through which independently developed pieces of computer software are able to share data and functionality in real time – provide one approach to greatly increasing the flexibility and thus utility of such projects.

With these goals in mind, the Chinese Text Project has recently published its own API, which provides machine-readable export of data from any of the texts and editions in its collection, together with a mechanism to make external tools and resources directly accessible through its user interface in the form of user-installable “plugins”. While many of these have already been created – such as those for the MARKUS textual markup platform as well as a range of online Chinese dictionaries – the true value of such APIs lies in their flexibility, in particular their ability to be adapted to new resources and new use cases without requiring additional coordination or development work, often leading to their successful application to use cases quite unrelated to those for which they were first created.

While the Chinese Text Project API was developed primarily with the goal of facilitating online collaboration, it is now also being used to facilitate digital humanities teaching and research. In the spring semester of 2016, graduate students at Harvard University’s Department of East Asian Languages and Civilizations made extensive use of the API as part of the course Digital Methods for Chinese Studies, which introduced students with backgrounds in Chinese history and literature to practical programming and digital humanities techniques. By making use of the API, it was possible for students to obtain digital copies of precisely the texts they needed in exactly the format they required without the significant additional effort this would normally entail. Rather than working with set example texts for which data had been pre-compiled into the required format or spending classroom time dealing with uninteresting methods of data preparation, the API made it possible for students to directly access the texts most relevant to their own work in a consistent format with no additional work. For the same reasons of consistency, programs written to perform a given set of operations on one text could immediately be applied to any other text from the tens of thousands available through the API.

Part of a network graph representing single-character explanatory glosses given in the early character dictionary the Shuowen jiezi. Arrows indicate direction of explanation.


The application of digital techniques developed in other domains to humanities questions – in this case, of crowdsourcing and APIs to the simple but fundamental question “What does the text actually say?” – is characteristic of the emerging field of digital humanities. Collaboration – facilitated in this case by these same techniques – often plays an important role in such projects, due to the enormous amounts of data available, the scalability of digital techniques in comparison to individual manual effort, and the power of digital methods to help make sense of a volume of material larger than any individual could plausibly analyze by hand.

Donald Sturgeon is Postdoctoral Fellow in Chinese Digital Humanities and Social Sciences at Harvard University’s Fairbank Center for Chinese Studies, and editor of the Chinese Text Project.

Posted in Chinese, Digital Humanities | Comments Off