Constructing a crowdsourced linked open knowledge base of Chinese history

Paper presented at Pacific Neighborhood Consortium Conference 2021


This paper introduces a crowdsourced approach to knowledge base construction for historical data based upon annotation of historical source materials. Building on an existing digital library of premodern Chinese texts and adapting techniques from other annotation and knowledge base projects, this lays the groundwork for a scalable, sustainable, linked open repository of data covering around 3000 years of recorded Chinese history.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

AAS 2021

Zoom link:


  • Recommended web browsers: Firefox or Chrome; Safari and Edge should also work for most tasks, but have not been fully tested.
  • Create a ctext account and log in
  • Check your e-mail (and spam folder) for an e-mail sent from the system, and click the link in the e-mail to validate your account.
  • Go to “Settings” at the bottom left, enter the API key specified in the live session in the box under “API key”, and click “Save”
  • Install the Text Tools plugin by opening this link, and then clicking “Install”
  • Install the Annotation plugin by opening this link, and then clicking “Install”

Some parts of the material that will be covered in the session are available in step-by-step tutorials, which also include other details and examples and might be useful if you want to come back to the material later:

These parts of the instructions should also be useful:

Lastly, some of these papers may be of interest:

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Crowdsourcing Chinese history: distributed transcription, annotation, and datafication

Poster presented at Linked Pasts 2020

For more details, please see the extended abstract “Crowdsourcing the historical record: knowledge base construction for Chinese history at scale” presented at Taiwan’s DADH 2020 conference.


The tasks of semantically annotating historical primary source materials and systematically recording knowledge about historical entities have closely connected conceptual relationships. Annotations can be leveraged to extract knowledge about entities, and knowledge about entities can be leveraged to aid in the efficient annotation of texts. Both tasks also rely in practice upon accurate transcriptions of primary source materials, which – like annotations and structured knowledge – are costly to produce in the first instance. This poster describes a crowdsourced approach in which all three tasks are carried out in parallel, involving not only the distributed transcription of texts, but also the creation of a knowledge base continually expanded through user contributions made partly through annotation, immediately and automatically contributing to improved automated assistance with ongoing and future crowdsourced annotation.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Digitizing Premodern Text with the Chinese Text Project

Paper published in Journal of Chinese History


The widespread availability of digitized premodern textual sources – together with increasingly sophisticated means for their manipulation – has brought enormous practical benefits to scholars whose work relies upon reference to their contents. While great progress has been made with the construction of ever more comprehensive database systems and archives, far more remains not only possible but also realistically achievable in the near future. This paper discusses some of the key challenges faced, and progress made towards solving them, in the context of a widely used open digital platform attempting to expand the range of digitized sources available while simultaneously increasing the scope of meaningful tasks that can be performed with them computationally. This paper aims to suggest how seemingly simple human-mediated additions to the digitized historical record – when combined with the power of digital systems to repeatedly perform mechanical tasks at enormous scales – quickly lead to transformative changes in the feasible scope of computational analysis of premodern writing.

Full text through publisher site (PDF, paywall) / Free online version (full text but no PDF)

Part of a JCH Special Issue on Digital Humanities.

Posted in Chinese, Digital Humanities | Comments Off

MARAAS workshop

Materials from a workshop held as part of the MARAAS Conference: Asian Studies in the Digital Age at Dickinson College, Carlisle, PA. [Download slides]


  • Create a free account on and log in.
  • Make sure to validate your e-mail address by opening the link the system sent you (if not, the link above will display a warning/reminder in red to do so).
  • Enter the API key in the box labeled “API key”, and click “Save”.

We will follow parts of the “Practical introduction to” and “Text Tools for” tutorials with a few changes and a few new features not yet included in the tutorials.

Link to Text Tools:

Important note/reminder: For tools which have this option, we will use “Tokenize by character” set to “On” for the Chinese materials, and “Off” for the English ones.

Overviews of functionality

The following give some basic illustrations of what can be done in Text Tools through concrete examples:

Other suggested examples

As well as the examples shown in the tutorials:

  • To see how the tool works with tokenized materials, download the following English text files (e.g. right-click each link and choose “Save as”):
  • Try some regexes on the English examples. A useful expression is likely to be “\w+” – any sequence of non-punctuation characters (intuitively, a word). Try as an example “the \w+”.
  • Using the “English_wordlist.txt” file as a list of regexes (just paste the contents of the file into the “Regex” box), generate vectors for the two Wizard of Oz stories. Run PCA on the results – you should see interesting differences between the two. Also try preprocessing the data by tokenizing and lowercasing the texts.
  • Try tokenizing one or more modern Chinese documents [example].
  • Using the regex tool with “Group rows by” set to “None” and “Extract groups” checked, try extracting data about biographies in the 宋史. You may want to start by using a small part of the text, e.g. ctp:ws55241. Example regex: (\w+),字(\w+),(\w+)人。

Related research

Posted in Digital Humanities, Video | Comments Off

Durham Institute for Data Science (IDAS) launch

It was a pleasure to take part in the Durham Institute for Data Science (IDAS) launch event.

The slides from my talk, Interactive text mining and visualization in the humanities, are available online.

Posted in Digital Humanities, Talks and conference papers | Comments Off

Old texts in a new world: Meaning production in the digital medium

Paper presented at Materiality of Knowledge in Chinese Thought: Past and Present, Yuelu Academy


Throughout history, technical innovations in the production and transmission of written materials have often had far-reaching long-term consequences for knowledge production – from the standardization of writing forms, to the development of dictionaries and encyclopedias, to the availability and spread of printing and copying technologies. In this paper, I focus on the ongoing impact of the most recent such development: digitization and increasing use of digital modes of interaction with premodern textual materials.

Since premodern Chinese documents first became available to scholars in digital form, the existence of digital texts has caused gradual but significant changes in mainstream scholarly workflows and expectations. Full-text repositories and digital libraries now make available in seconds to anyone on the planet premodern materials on a scale once impossible for anyone other than a determined emperor to obtain, while making similarly fantastic reductions in time and effort required to retrieve certain types of information. At the same time, even more dramatic changes have begun to take place as a consequence of digitization together with the ever-increasing sophistication and power of digital systems. Faced with larger volumes of material than any individual could ever expect to read – let alone claim detailed knowledge of – text mining and distant reading approaches offer the promise of gleaning useful information from exhaustive statistical analyses at scales not achievable through traditional means. Data-driven approaches – already well developed in other disciplines – similarly enable digital approaches to historical studies in which evidence can be systematically assembled at large enough scales to solidly ground statistical claims about broad historical and societal changes over time. This paper explores the development of these approaches, and the consequences for knowledge production in the digital age.

Posted in Chinese, Talks and conference papers | Comments Off

Chinese Text Project: a dynamic digital library of premodern Chinese

Paper published in Digital Scholarship in the Humanities


This article presents technical approaches and innovations in digital library design developed during the design and implementation of the Chinese Text Project, a widely-used, large-scale full-text digital library of premodern Chinese writing. By leveraging a combination of domain-optimized Optical Character Recognition, a purpose-designed crowdsourcing system, and an Application Programming Interface (API), this project simultaneously provides a sustainable transcription system, search interface and reading environment, as well as an extensible platform for transcribing and working with premodern Chinese textual materials. By means of the API, intentionally loosely integrated text mining tools are used to extend the platform, while also being reusable independently with materials from other sources and in other languages.

Full text [preprint]
Version of record

Posted in Chinese, Digital Humanities | Comments Off

Digital Approaches to Text Reuse in the Early Chinese Corpus

Published in Journal of Chinese Literature and Culture 2018, 5(2) [Full paper]

Observed textual similarities between different pieces of writing are frequently cited by textual scholars as grounds for interpretative stances about the meaning of a passage and its authorship, authenticity, and accuracy. Historically, identifying occurrences of such similarities has been a matter of extensive knowledge and recall of the content and locations of passages contained within certain texts, together with painstaking manual comparison by examining printed copies, use of concordances, or more recently, appropriate use of full-text searchable database systems. The development of increasingly comprehensive and accurate digital corpora of early Chinese transmitted writing raises many opportunities to study these phenomena using more systematic digital techniques. These offer the promise of not only vast savings in time and labor but also new insights made possible only through exhaustive comparisons of types that would be entirely impractical without the use of computational methods.

This article investigates and contrasts unsupervised techniques for the identification of textual similarities in premodern Chinese works in general, and the classical corpus in particular, taking the text of the Mozi 墨子 as a concrete example. While specific examples are presented in detail to concretely demonstrate the utility and potential of the techniques discussed, all of the methods described are generally applicable to a wide range of materials. With this in mind, this article also introduces an open-access platform designed to help researchers quickly and easily explore these phenomena within those materials most relevant to their own work.

Posted in Chinese, Digital Humanities | Comments Off

Accessible Text Mining with Text Tools and the Chinese Text Project


  • Create a free account on and log in.
  • Make sure to validate your e-mail address by opening the link the system sent you (if not, the link above will display a warning/reminder in red to do so).
  • Enter the API key “aas2019″ (without quotes) in the box labeled “API key”, and click “Save”.
  • [Optional] Install the “Text Tools” plugin into your ctext account.

Some parts of the “Practical introduction to” and “Text Tools for” will be demonstrated – please refer to the tutorials for step-by-step instructions.

Direct link to Text Tools:

Other suggested examples

In addition to the examples shown in the tutorials:

  • Try comparing the aggregate vocabulary of two texts (e.g. the 墨子 and 呂氏春秋) using the “Vectors” tab. Click “Toggle values” to display the heatmap, and try inspecting some of the comparisons.
  • Try the “Run PCA” link with these or other texts.
  • Try creating vectors that model only a specifically selected subset of vocabulary use. To do this, start by entering multiple search terms in the Regex tool (one per line) – one example would be grammatical particles such as 而, 也, 以, 乎, 之, 矣, 亦. From the “Summary” tab, click “Create vectors”, and then from the output choose “Run PCA”.
  • Using the regex tool with “Group rows by” set to “None” and “Extract groups” checked, try extracting data about biographies in the 宋史. You may want to start by using a small part of the text, e.g. 列傳第二十一 (ctp:ws281485). Example regex: (\w+),字(\w+),(\w+)人。
  • A few additional examples and instructions for using materials not written in classical Chinese are available on the SUTD workshop page.
Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off