Digital Approaches to Text Reuse in the Early Chinese Corpus

Published in Journal of Chinese Literature and Culture 2018, 5(2) [Full paper]

Observed textual similarities between different pieces of writing are frequently cited by textual scholars as grounds for interpretative stances about the meaning of a passage and its authorship, authenticity, and accuracy. Historically, identifying occurrences of such similarities has been a matter of extensive knowledge and recall of the content and locations of passages contained within certain texts, together with painstaking manual comparison by examining printed copies, use of concordances, or more recently, appropriate use of full-text searchable database systems. The development of increasingly comprehensive and accurate digital corpora of early Chinese transmitted writing raises many opportunities to study these phenomena using more systematic digital techniques. These offer the promise of not only vast savings in time and labor but also new insights made possible only through exhaustive comparisons of types that would be entirely impractical without the use of computational methods.

This article investigates and contrasts unsupervised techniques for the identification of textual similarities in premodern Chinese works in general, and the classical corpus in particular, taking the text of the Mozi 墨子 as a concrete example. While specific examples are presented in detail to concretely demonstrate the utility and potential of the techniques discussed, all of the methods described are generally applicable to a wide range of materials. With this in mind, this article also introduces an open-access platform designed to help researchers quickly and easily explore these phenomena within those materials most relevant to their own work.

Posted in Chinese, Digital Humanities | Comments Off

Accessible Text Mining with Text Tools and the Chinese Text Project


  • Create a free account on and log in.
  • Make sure to validate your e-mail address by opening the link the system sent you (if not, the link above will display a warning/reminder in red to do so).
  • Enter the API key “aas2019″ (without quotes) in the box labeled “API key”, and click “Save”.
  • [Optional] Install the “Text Tools” plugin into your ctext account.

Some parts of the “Practical introduction to” and “Text Tools for” will be demonstrated – please refer to the tutorials for step-by-step instructions.

Direct link to Text Tools:

Other suggested examples

In addition to the examples shown in the tutorials:

  • Try comparing the aggregate vocabulary of two texts (e.g. the 墨子 and 呂氏春秋) using the “Vectors” tab. Click “Toggle values” to display the heatmap, and try inspecting some of the comparisons.
  • Try the “Run PCA” link with these or other texts.
  • Try creating vectors that model only a specifically selected subset of vocabulary use. To do this, start by entering multiple search terms in the Regex tool (one per line) – one example would be grammatical particles such as 而, 也, 以, 乎, 之, 矣, 亦. From the “Summary” tab, click “Create vectors”, and then from the output choose “Run PCA”.
  • Using the regex tool with “Group rows by” set to “None” and “Extract groups” checked, try extracting data about biographies in the 宋史. You may want to start by using a small part of the text, e.g. 列傳第二十一 (ctp:ws281485). Example regex: (\w+),字(\w+),(\w+)人。
  • A few additional examples and instructions for using materials not written in classical Chinese are available on the SUTD workshop page.
Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Text Transformation API

Draft – This is a preliminary draft specification. Please note that some implementation details will change before publication. Last updated: 22 March 2019.


Transformations of textual data are important processes in many natural language processing and text analysis workflows. Examples include tokenization, lemmatization, and appending of part of speech tags, as well as many other (often language-specific) procedures. In this specification, a text transformation is any operation which takes as input a sequence of Unicode characters, and produces as output a sequence of Unicode characters. The Text Transformation API (TTA) defines a simple specification for how to negotiate, request, and deliver text transformations over HTTP.

A TTA server is a system which both: 1) publishes a TTA service manifest, and 2) provides or references at least one TTA transformation service endpoint.

Service manifest

A service manifest is a valid JSON file containing a list of transformation services. Each service is described using the following key-value pairs:

Key Value
endpoint The URL of the transformation service endpoint described by this entry.
languages A list of ISO 639-1 language codes to which the endpoint is relevant or recommended.
title A human readable description of the service the endpoint describes.

Transformation service endpoint

A transformation endpoint is a HTTP or HTTPS URL which accepts a string of text sent to it via the HTTP POST method using the “application/x-www-form-urlencoded” content type. The content of the string must be supplied in the “data” parameter of the request in UTF-8 encoding.

The response to any valid request must be a JSON file containing exactly one of the following key value pairs:

Key Value
output The contents of the “data” parameter transformed according to the service provided by the requested endpoint.
error A string explaining why the request failed.

Transformation client

A transformation client is any software which 1) requests TTA service manifests, specified by their URL; 2) provides a user with a means of viewing the “title” descriptions of the endpoints from any conformant TTA manifest, and 3) provides a user with a means of transforming texts using any conformant endpoint.


A non-normative example of a TTA service manifest (containing references to example TTA service endpoints) is:

A non-normative example of a TTA client is accessible here.

Posted in Digital Humanities | Comments Off

SUTD Workshop

Materials from a workshop held as part of Working with different kinds of ‘text’ in the Digital Humanities at the Singapore University of Technology and Design.


We will follow parts of the “Practical introduction to” and “Text Tools for” tutorials with a few changes to use English language texts as well as Chinese ones, and a few new features of the beta version not yet included in the tutorials.

Link to Text Tools (beta version):

Important note/reminder: For tools which have this option, we will use “Tokenize by character” set to “On” for the Chinese materials, and “Off” for the English ones.

Other suggested examples

As well as the examples shown in the tutorials:

  • Try some regexes on the English examples. A useful expression is likely to be “\w+” – any sequence of non-punctuation characters (intuitively, a word). Try as an example “the \w+”.
  • Using the “English_wordlist.txt” file as a list of regexes (just paste the contents of the file into the “Regex” box), generate vectors for the two Wizard of Oz stories. Run PCA on the results – you should see interesting differences between the two. Also try preprocessing the data by tokenizing and lowercasing the texts.
  • Try tokenizing one or modern Chinese documents [example].
  • Using the regex tool with “Group rows by” set to “None” and “Extract groups” checked, try extracting data about biographies in the 宋史. You may want to start by using a small part of the text, e.g. ctp:ws55241. Example regex: (\w+),字(\w+),(\w+)人。
Posted in Chinese, Digital Humanities | Comments Off

Large-scale Optical Character Recognition of Pre-modern Chinese Texts

This paper appears in International Journal of Buddhist Thought and Culture 28(2) (December 2018). [Full paper]


Optical character recognition (OCR) – the fully automated transcription of text appearing in a digitized image – offers transformative opportunities for the scholarly study of written materials produced prior to the digital age. Digitization, in the sense of photographic reproduction, is a largely straightforward, mechanical process, and one with significant value in its own right for purposes of preservation as well as access to rare materials. As a result, hundreds of millions of pages of pre-modern Chinese works have been digitized by libraries and academic institutions around the world – a significant portion of this increasingly being made freely available online.

To make use of this material efficiently, transcriptions of the textual content of these images are needed. Given the enormous volume of image data in existence – and its continual production as digitization continues – this task is only feasible if it can be fully automated: performed by software without manual intervention. Individually, reliable transcriptions produced by OCR offer enormous time savings to researchers, making it possible to efficiently navigate materials in ways not possible without digital transcription. In aggregate, however, these transcriptions make possible entirely new ways of exploring historical materials – making it possible to rapidly identify material that one suspects may exist somewhere, without knowing in advance where that might actually be. It is also a prerequisite also to virtually any type of statistical analysis of these materials – the potential utility of which continues to increase as a larger and larger proportion of the extant corpus is transcribed.

This paper introduces a procedure for OCR of pre-modern Chinese written materials, both printed and handwritten, describing the complete process from digitized image through to automated transcription and manual correction of remaining errors, with particular attention to issues arising in this domain. The process described has been applied to over 25 million pages of pre-modern Chinese works, and the paper also introduces the Chinese Text Project platform used to both make these results available to scholars as well as provide a distributed, crowdsourced mechanism for facilitating manual corrections at scale as well as further analysis of these materials.

Noise removal

Character pitch identification

Seal isolation

Posted in Chinese, Digital Humanities | Comments Off

EASTD 135: Text and Data in the Humanities

This course introduces students to key concepts and techniques fundamental to applying digital methods to the study of textual materials and other types of data in humanities subjects. The core topics covered are digital representations of data, ways of structuring and managing data, extracting data from textual materials, and data visualization and analysis. Concepts introduced in lecture sessions will be reinforced and applied concretely in particular contexts during corresponding practical sessions and take-home assignments.

No background in digital methods is assumed, however students are expected to have basic computing skills and access to a suitable laptop. Examples will be selected from a variety of subject domains within the humanities with the primary focus being on textual materials.


Week 1 (Jan 28, 30) – Introduction and motivation

  • Data and digital techniques in the humanities
  • Examples of data-driven approaches in humanities scholarship

Week 2 (Feb 4, 6) – Representation I

  • Fundamentals of digital representation of information
  • Basic types of data and their digital representations

Week 3 (Feb 11, 13) – Data and ontologies I

Week 4 (Feb 20) – Representation II

  • Research data management

Week 5 (Feb 25, 27) – Data and ontologies II

  • Databases and structured data

Week 6 (Mar 4, 6) – Data and ontologies III

  • Linked Open Data in the humanities

Week 7 (Mar 11, 13) – From text to data I

Week 8 (Mar 18, 20) – Visualization I

  • Charts and diagrams

Week 9 (Mar 25, 27) – Visualization II

  • Graphs, maps, and trees

Week 10 (Apr 1, 3) – From text to data II

  • Topic modeling

Week 11 (Apr 8, 10) – From text to data III

  • Part of speech tagging and parsing of natural languages

Week 12 (Apr 15, 17) – From text to data IV

  • Markup and annotation systems

Week 13 (Apr 22, 24) – Review

  • Review and discussion of project work

Week 14 (Apr 29, May 1) – Project presentations

  • Student projects presented in class

Posted in Courses, Digital Humanities | Comments Off

Networks of Text Reuse in Early Chinese Literature

Poster presented at Connected Past 2018.


The phenomenon of text reuse – syntactically and semantically similar fragments of text repeated apparently independently in multiple pieces of writing, and often in works purporting to be composed by entirely different authors – is extremely widespread in early Chinese literature. Such reuse is typically unattributed, and its existence is often revealed only through painstaking comparison with other pieces of potentially related writing. Computational methods have for the first time made feasible the comprehensive identification of such reuse throughout large corpora of material, and have thus made practical studies based on patterns of reuse which emerge at much larger scales than had previously been possible to consider.

This work uses network analysis to investigate patterns of text reuse in the early Chinese corpus and the relationship between these patterns and difficult questions of authorship attribution within these texts. Using detailed data on individual instances of text reuse created through an exhaustive automated study of the entire transmitted corpus of Chinese from the earliest transmitted works through to those dating prior to the end of the Han dynasty (220 AD), this study demonstrates the utility of network visualization and analysis in identifying and exploring patterns of text reuse which shed light on the authorship of these early materials.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Accessible digital text analysis for classical Chinese

Paper presented at Future Philologies: Digital Directions in Ancient World Text, Institute for the Study of the Ancient World, New York University, April 20 2018.


Despite a growing interest in digital humanities as a field of study and focus of specialization, significant barriers to the adoption of digital techniques remain within research and teaching in practice in many humanities disciplines. While an increasing number of humanities scholars have demonstrated willingness to invest time and effort in cultivating necessary technical skills, in practice many more are prevented from experimenting with digital methods due to perceived high barriers to entry. One approach to accelerate the adoption of digital techniques is to attempt to reduce the prerequisite technical skills required to apply techniques to research data in practice through the creation of platforms and tools able to bridge technical gaps for some of the most powerful and generally applicable use cases.

With this goal in mind, this talk introduces a suite of browser-based text analysis tools designed for pre-modern Chinese materials and intended to easily integrate into scholarly workflows, including in particular those common in Chinese literature, philosophy, and history departments. Major goals include accessibility of the tools themselves, as well as transparency of their working and ability to introspect the mechanisms underwriting the results and visualizations produced. By enabling rapid exploration of arbitrarily chosen textual materials while also providing insight into the algorithms used, these tools have pedagogical applications in addition to research uses, and are already in use teaching at several institutions.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Cyberinfrastructure for historical China studies

It was a pleasure to host on behalf of the Chinese Text Project (, together with Professor Peter Bol on behalf of the China Biographical Database (CBDB), the International conference on cyberinfrastructure for historical China studies, held at the Harvard China Center in Shanghai, March 14-16, 2018.

The full program and additional information is available on; the slides I used during the conference are also available for download below:


中国哲学书电子化计划 / Chinese Text Project
[English] [Chinese]

文本分析工具 / Text Tools
[English] [Chinese]

线上数据库的可扩展数据分享 / Scalable data sharing for online database systems
[English] [Chinese]

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off





  • 建立帐户:在左手栏目中,往下卷动并点击“登入”,然后在“若尚未建立本站的帐户”的表格中输入您的资料,再点“建立帐户”。
  • 确认电脑字体是否已安装:在左上角点“本站介绍”,再点“字体试验页”。


  • 使用左手栏目中的“书名检索”功能。
  • 检索结果中,“”图标表示该文献的内容可以直接连接到对应的扫描影印资料。
  • 此外,检索结果中可能会看到以下图标:
  • 习题:
    • 找出《资暇集》的电子全文。
    • 在原典数据库中找出先秦两汉时代的一部经典(如:《庄子》、《荀子》等)。


  • 首先找出并打开想要检索的文字版翻译(章节或是卷),点击 左手栏目下部的“检索” 框。
  • 习题:
    • 找出《论语》中带有孔子所说“君子不器”的段落。
    • 找出《庄子》中所有有提到“道”的段落。
  • 当你在文本资料库中检索本文得出多个结果时,可以点击页面右上部的“显示统计”链接,打开检索结果的互动摘要。


  • 在ctext中,可以通过影印底本连结来检索影印资料。当文字版中带有影印连结时,书名检索结果中会显示“”的图标。
  • 当文本跟扫描档案有链接时,点击左方文本中任何一个段落的“”图标,打开对应的扫面版本。
  • 当你要在扫面文本中检索特定的单词或是片语,在文字版中检索对应的单词或是片语,点击左方结果中的“”图标。
  • 文字版中出现的错误(特别在OCR得出的文字版中)表示片语越长越不一致。如遇到这个情况,试图检索短一点的片语或是想要检索的文本附近出现的单词。
  • 习题:
    • 找出有扫描版链接的文本,检索并检视扫面版中的结果。
    • 在OCR得出的文字版中重复一次。
    • 你也可以从“图书馆”中找出扫描文本,这个检索跟你检索文字版链接会有完全一致的结果。
    • 或者,你也可以使用链接来读出每一页的扫描本。



  • 找出文本片语,点击“”图标,打开相似段落的概要。
  • 在结果一栏中,点击标题附近的“”图标,显示每一个结果和它出现的文脉。
  • 习题:
    • 找出与《庄子》中“庖丁解牛”故事相似的段落。



  • 点击左边一栏下部的“高级检索”链接。
  • 在“1. 检索范围”部分中,选择第一个种类,文本或是你想要检索的文本单位。例如,在《庄子》中检索时,你可以选择“先秦和汉”,再选“道家”,再选“庄子” (第四个框中保持“[全部]”)。
  • 在“3. 检索条件”部分中,打钩“相似段落搜索”下面的框,然后以同样的方法设定本文范围(范围亦可为文本类型、文本整体或文本部分)。
  • 点击“Search”。结果会显示所有包含相似段落的文本。
  • 习题:
    • 找出《论语》和“道家”类文献中所有相似段落。
    • 当你有结果时,点击“显示统计”的链接。
    • 做同样的检索,这一次逆转“检索范围” 和“检索条件”,然后使用“显示统计”。



  • 首先打开文本的页面。页面右手方有一个搜索框,对应着每一种支持文本的索引、引得编号。
  • 习题:
    • 在Eric Hutton的这篇论文中,作者使用了ICS系列和哈弗燕京系列中的索引、引得编号来表示文本的引用处,因而没有直接引用中文文本,例如:




  • 在带有索引、引得数据的段落左手边点击“”图标。
  • 在段落上移动鼠标会显示所有对应到鼠标位置的索引、引得编号。
  • 如欲显示与段落中特定一段的索引、引得编号时,使用鼠标点击并将其移动到加亮为绿色的文本。所有与加亮文本有交叉的索引、引得编号会随之显示出来。
  • 习题:
    • 接着以上习题,找出对应于《荀子》中“人之性恶,其善者伪也。”这一段的索引、引得编号。



  • 一般来说,当检视带有翻译的文本时,紧接着中文的一个段落(可能很长),会显示英语段落。如想让文本和翻译更接近地显示(通常是一句一句),点击文本片语左方的“”图标。使用者也可以移动鼠标指针到中文文本来显示字典信息。
  • 如所检视的段落较长时,使用者可以先检索中文句子,再点击“”图标如上,直接跳到对应特定中文文本的翻译。
  • 习题:
    • 使用《庄子》的文本来做实验。
    • 用这个功能检视James Legge如何在同一个文本中翻译“每至于族,吾见其难为,怵然为戒,视为止,行为迟”。



  • 点击文本左方的“”图标。请注意,注释本身也是独立的文本,所以你可以点所显示的注释中的链接转到注释文本。
  • 习题:
    • 使用《论语》、《孟子》、《墨子》或是《道德经》做实验。


  • 打开网站中“字典”的部分。
  • 根据所欲输入的字的情况可用以可以选择:
    • 直接输入(打字)
    • 结构查询:请参阅字典主页上的简要,
    • 部首查询:先选择部首,然后按照附加笔画查看。你可以通过点击“n strokes”标记来加大显示中的文字。
  • 习题:
    • 䊫, 𥼺, 𧤴, …:在ctext.org的字典中检索这些字(要使用上述说明的方法,不要直接从这个页面复制粘贴)。
  • 小贴士:如果两种组成部分都不容易输入的话,你可以输入包含其成分的任何其他字,通过分解,你可以找出成分,然后可以找出包含有那个特定成分的其他字。
  • ctext上的某些文本中有统一码中不存在的汉字。这些汉字目前只能通过成分来检索。
    • 你可以为了在ctext中使用,复制和粘贴在统一码中不存在的汉字。当统一码中不存在的汉字被复制时,将会变成“ctext:nnnn”识别符号(例如,ctext:1591)。在其他软件中粘贴这类字时(例如,在Word等软体),会粘贴识别符号,而不是字或是图像。
    • 然而,您可以在ctext中点击字体的右方并选择“复制图像”来复制字的图像。这个可以粘贴在Word文档(例如,通过参照“ctext:nnnn”识别符号或是提供网站链接)
    • 例子:ctext:4543 ctext:8668 ctext:3000 ctext:335



  • 找出扫描版中出现传写错误的扫描页面。
  • 点击“简单修改模式”链接。
  • 系统会显示可直接编辑文字内容的输入方块。一般来说,每一行文字对应着扫描图像中的每一栏文字。
  • 细心修正文本保持与扫描版一致,结束后点击“保存编辑”。
  • 如果需要输入空格,请务必使用全角中文空格,而不是英文的半角空格。
  • 习题:
    • 选择一件透过OCR打造的文本资料,纠正其错误。
  • 详细纪录文本中每一次的修改并提供换回到更早版本的方法,是任何一种维基系统的基础–即所谓的“版本控制”。当您保存了修改之后:
    • 点击“文字版” 链接来打开您修改的文本全文
    • 往上卷动,点击“查看历史”来显示最近的历史纪录,您最近的修改会显示在最上边。
    • 每一行代表着一次修改之后的文本状态。可以通过在表的左方选择两种单选按钮,并点击“显示相差”来比较文本在两个时刻的状态。默认选择会比较文本当前状态和最新修改前的版本。点击“显示相差”来视觉化您刚刚做过的修改。



  • 打开“本站介绍” > “工具” > “插件”。
  • 找出您想要安装的插件,点击“安装”。
  • 在确认页面中再点击“安装”。

安装插件之后,当您打开任何在ctext.org支持的对象时(例如, “book” 或是 “chapter”插件对应的是文本章节,“character”和“word”类插件对应的是字典中的汉字或中文单词),对应的链接会显示在屏幕上方附近的一栏中。

  • 安装“全文输出”插件并使用它输出文本的一章内容。
  • 安装“Frequencies”插件并使用它检视文本章节中出现的汉字的频率。
  • 安装任何一个带有“character” 或是“word”类的插件,在字典中找一个字,使用插件进入外部字典。






如果您创建的新插件或是代码不被ctext介面接受的话,您可以使用W3C Markup Validator来确认您的插件文档是否有效。一个有效的插件会有绿色的字体写上“This document was successfully checked as CTPPlugin!”,看起来是这样的页面



程序存取是通过ctext.org的应用程式介面(API)来实现的,您可以通过任何能够发送HTTP请求的程式语言或环境实行。本网特别推荐Python,因为您可以通过已有的ctext Python module使用API,这样可以降低程式开发上所需的时间。除了API的使用说明以外,您可以参考包括具体实例的API函数列单

Creative Commons License
Posted in Uncategorized | Comments Off