Digitizing Premodern Text with the Chinese Text Project

Paper published in Journal of Chinese History

Abstract

The widespread availability of digitized premodern textual sources – together with increasingly sophisticated means for their manipulation – has brought enormous practical benefits to scholars whose work relies upon reference to their contents. While great progress has been made with the construction of ever more comprehensive database systems and archives, far more remains not only possible but also realistically achievable in the near future. This paper discusses some of the key challenges faced, and progress made towards solving them, in the context of a widely used open digital platform attempting to expand the range of digitized sources available while simultaneously increasing the scope of meaningful tasks that can be performed with them computationally. This paper aims to suggest how seemingly simple human-mediated additions to the digitized historical record – when combined with the power of digital systems to repeatedly perform mechanical tasks at enormous scales – quickly lead to transformative changes in the feasible scope of computational analysis of premodern writing.

Full text through publisher site (PDF, paywall) / Free online version (full text but no PDF)

Part of a JCH Special Issue on Digital Humanities.

Posted in Chinese, Digital Humanities | Comments Off

MARAAS workshop

Materials from a workshop held as part of the MARAAS Conference: Asian Studies in the Digital Age at Dickinson College, Carlisle, PA. [Download slides]

Setup

  • Create a free account on ctext.org and log in.
  • Make sure to validate your e-mail address by opening the link the system sent you (if not, the link above will display a warning/reminder in red to do so).
  • Enter the API key in the box labeled “API key”, and click “Save”.

We will follow parts of the “Practical introduction to ctext.org” and “Text Tools for ctext.org” tutorials with a few changes and a few new features not yet included in the tutorials.

Link to Text Tools: http://ctext.org/plugins/texttools/#help

Important note/reminder: For tools which have this option, we will use “Tokenize by character” set to “On” for the Chinese materials, and “Off” for the English ones.

Overviews of functionality

The following give some basic illustrations of what can be done in Text Tools through concrete examples:

Other suggested examples

As well as the examples shown in the tutorials:

  • To see how the tool works with tokenized materials, download the following English text files (e.g. right-click each link and choose “Save as”):
  • Try some regexes on the English examples. A useful expression is likely to be “\w+” – any sequence of non-punctuation characters (intuitively, a word). Try as an example “the \w+”.
  • Using the “English_wordlist.txt” file as a list of regexes (just paste the contents of the file into the “Regex” box), generate vectors for the two Wizard of Oz stories. Run PCA on the results – you should see interesting differences between the two. Also try preprocessing the data by tokenizing and lowercasing the texts.
  • Try tokenizing one or more modern Chinese documents [example].
  • Using the regex tool with “Group rows by” set to “None” and “Extract groups” checked, try extracting data about biographies in the 宋史. You may want to start by using a small part of the text, e.g. ctp:ws55241. Example regex: (\w+),字(\w+),(\w+)人。

Related research

Posted in Digital Humanities, Video | Comments Off

Durham Institute for Data Science (IDAS) launch

It was a pleasure to take part in the Durham Institute for Data Science (IDAS) launch event.

The slides from my talk, Interactive text mining and visualization in the humanities, are available online.

Posted in Digital Humanities, Talks and conference papers | Comments Off

Old texts in a new world: Meaning production in the digital medium

Paper presented at Materiality of Knowledge in Chinese Thought: Past and Present, Yuelu Academy

Abstract

Throughout history, technical innovations in the production and transmission of written materials have often had far-reaching long-term consequences for knowledge production – from the standardization of writing forms, to the development of dictionaries and encyclopedias, to the availability and spread of printing and copying technologies. In this paper, I focus on the ongoing impact of the most recent such development: digitization and increasing use of digital modes of interaction with premodern textual materials.

Since premodern Chinese documents first became available to scholars in digital form, the existence of digital texts has caused gradual but significant changes in mainstream scholarly workflows and expectations. Full-text repositories and digital libraries now make available in seconds to anyone on the planet premodern materials on a scale once impossible for anyone other than a determined emperor to obtain, while making similarly fantastic reductions in time and effort required to retrieve certain types of information. At the same time, even more dramatic changes have begun to take place as a consequence of digitization together with the ever-increasing sophistication and power of digital systems. Faced with larger volumes of material than any individual could ever expect to read – let alone claim detailed knowledge of – text mining and distant reading approaches offer the promise of gleaning useful information from exhaustive statistical analyses at scales not achievable through traditional means. Data-driven approaches – already well developed in other disciplines – similarly enable digital approaches to historical studies in which evidence can be systematically assembled at large enough scales to solidly ground statistical claims about broad historical and societal changes over time. This paper explores the development of these approaches, and the consequences for knowledge production in the digital age.

Posted in Chinese, Talks and conference papers | Comments Off

Chinese Text Project: a dynamic digital library of premodern Chinese

Paper published in Digital Scholarship in the Humanities

Abstract

This article presents technical approaches and innovations in digital library design developed during the design and implementation of the Chinese Text Project, a widely-used, large-scale full-text digital library of premodern Chinese writing. By leveraging a combination of domain-optimized Optical Character Recognition, a purpose-designed crowdsourcing system, and an Application Programming Interface (API), this project simultaneously provides a sustainable transcription system, search interface and reading environment, as well as an extensible platform for transcribing and working with premodern Chinese textual materials. By means of the API, intentionally loosely integrated text mining tools are used to extend the platform, while also being reusable independently with materials from other sources and in other languages.

Full text [preprint]
Version of record

Posted in Chinese, Digital Humanities | Comments Off

Digital Approaches to Text Reuse in the Early Chinese Corpus

Published in Journal of Chinese Literature and Culture 2018, 5(2) [Full paper]

Observed textual similarities between different pieces of writing are frequently cited by textual scholars as grounds for interpretative stances about the meaning of a passage and its authorship, authenticity, and accuracy. Historically, identifying occurrences of such similarities has been a matter of extensive knowledge and recall of the content and locations of passages contained within certain texts, together with painstaking manual comparison by examining printed copies, use of concordances, or more recently, appropriate use of full-text searchable database systems. The development of increasingly comprehensive and accurate digital corpora of early Chinese transmitted writing raises many opportunities to study these phenomena using more systematic digital techniques. These offer the promise of not only vast savings in time and labor but also new insights made possible only through exhaustive comparisons of types that would be entirely impractical without the use of computational methods.

This article investigates and contrasts unsupervised techniques for the identification of textual similarities in premodern Chinese works in general, and the classical corpus in particular, taking the text of the Mozi 墨子 as a concrete example. While specific examples are presented in detail to concretely demonstrate the utility and potential of the techniques discussed, all of the methods described are generally applicable to a wide range of materials. With this in mind, this article also introduces an open-access platform designed to help researchers quickly and easily explore these phenomena within those materials most relevant to their own work.

Posted in Chinese, Digital Humanities | Comments Off

Accessible Text Mining with Text Tools and the Chinese Text Project

Setup

  • Create a free account on ctext.org and log in.
  • Make sure to validate your e-mail address by opening the link the system sent you (if not, the link above will display a warning/reminder in red to do so).
  • Enter the API key “aas2019″ (without quotes) in the box labeled “API key”, and click “Save”.
  • [Optional] Install the “Text Tools” plugin into your ctext account.

Some parts of the “Practical introduction to ctext.org” and “Text Tools for ctext.org” will be demonstrated – please refer to the tutorials for step-by-step instructions.

Direct link to Text Tools: http://ctext.org/plugins/texttools/#help

Other suggested examples

In addition to the examples shown in the tutorials:

  • Try comparing the aggregate vocabulary of two texts (e.g. the 墨子 and 呂氏春秋) using the “Vectors” tab. Click “Toggle values” to display the heatmap, and try inspecting some of the comparisons.
  • Try the “Run PCA” link with these or other texts.
  • Try creating vectors that model only a specifically selected subset of vocabulary use. To do this, start by entering multiple search terms in the Regex tool (one per line) – one example would be grammatical particles such as 而, 也, 以, 乎, 之, 矣, 亦. From the “Summary” tab, click “Create vectors”, and then from the output choose “Run PCA”.
  • Using the regex tool with “Group rows by” set to “None” and “Extract groups” checked, try extracting data about biographies in the 宋史. You may want to start by using a small part of the text, e.g. 列傳第二十一 (ctp:ws281485). Example regex: (\w+),字(\w+),(\w+)人。
  • A few additional examples and instructions for using materials not written in classical Chinese are available on the SUTD workshop page.
Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Text Transformation API

Draft – This is a preliminary draft specification. Please note that some implementation details will change before publication. Last updated: 22 March 2019.

Overview

Transformations of textual data are important processes in many natural language processing and text analysis workflows. Examples include tokenization, lemmatization, and appending of part of speech tags, as well as many other (often language-specific) procedures. In this specification, a text transformation is any operation which takes as input a sequence of Unicode characters, and produces as output a sequence of Unicode characters. The Text Transformation API (TTA) defines a simple specification for how to negotiate, request, and deliver text transformations over HTTP.

A TTA server is a system which both: 1) publishes a TTA service manifest, and 2) provides or references at least one TTA transformation service endpoint.

Service manifest

A service manifest is a valid JSON file containing a list of transformation services. Each service is described using the following key-value pairs:

Key Value
endpoint The URL of the transformation service endpoint described by this entry.
languages A list of ISO 639-1 language codes to which the endpoint is relevant or recommended.
title A human readable description of the service the endpoint describes.

Transformation service endpoint

A transformation endpoint is a HTTP or HTTPS URL which accepts a string of text sent to it via the HTTP POST method using the “application/x-www-form-urlencoded” content type. The content of the string must be supplied in the “data” parameter of the request in UTF-8 encoding.

The response to any valid request must be a JSON file containing exactly one of the following key value pairs:

Key Value
output The contents of the “data” parameter transformed according to the service provided by the requested endpoint.
error A string explaining why the request failed.

Transformation client

A transformation client is any software which 1) requests TTA service manifests, specified by their URL; 2) provides a user with a means of viewing the “title” descriptions of the endpoints from any conformant TTA manifest, and 3) provides a user with a means of transforming texts using any conformant endpoint.

Examples

A non-normative example of a TTA service manifest (containing references to example TTA service endpoints) is: https://txt.ctext.org/services.pl

A non-normative example of a TTA client is accessible here.

Posted in Digital Humanities | Comments Off

SUTD Workshop

Materials from a workshop held as part of Working with different kinds of ‘text’ in the Digital Humanities at the Singapore University of Technology and Design.

Setup

We will follow parts of the “Practical introduction to ctext.org” and “Text Tools for ctext.org” tutorials with a few changes to use English language texts as well as Chinese ones, and a few new features of the beta version not yet included in the tutorials.

Link to Text Tools (beta version): http://ctext.org/plugins/texttoolsbeta/#help

Important note/reminder: For tools which have this option, we will use “Tokenize by character” set to “On” for the Chinese materials, and “Off” for the English ones.

Other suggested examples

As well as the examples shown in the tutorials:

  • Try some regexes on the English examples. A useful expression is likely to be “\w+” – any sequence of non-punctuation characters (intuitively, a word). Try as an example “the \w+”.
  • Using the “English_wordlist.txt” file as a list of regexes (just paste the contents of the file into the “Regex” box), generate vectors for the two Wizard of Oz stories. Run PCA on the results – you should see interesting differences between the two. Also try preprocessing the data by tokenizing and lowercasing the texts.
  • Try tokenizing one or more modern Chinese documents [example].
  • Using the regex tool with “Group rows by” set to “None” and “Extract groups” checked, try extracting data about biographies in the 宋史. You may want to start by using a small part of the text, e.g. ctp:ws55241. Example regex: (\w+),字(\w+),(\w+)人。
Posted in Chinese, Digital Humanities | Comments Off

Large-scale Optical Character Recognition of Pre-modern Chinese Texts

This paper appears in International Journal of Buddhist Thought and Culture 28(2) (December 2018). [Full paper]

Abstract

Optical character recognition (OCR) – the fully automated transcription of text appearing in a digitized image – offers transformative opportunities for the scholarly study of written materials produced prior to the digital age. Digitization, in the sense of photographic reproduction, is a largely straightforward, mechanical process, and one with significant value in its own right for purposes of preservation as well as access to rare materials. As a result, hundreds of millions of pages of pre-modern Chinese works have been digitized by libraries and academic institutions around the world – a significant portion of this increasingly being made freely available online.

To make use of this material efficiently, transcriptions of the textual content of these images are needed. Given the enormous volume of image data in existence – and its continual production as digitization continues – this task is only feasible if it can be fully automated: performed by software without manual intervention. Individually, reliable transcriptions produced by OCR offer enormous time savings to researchers, making it possible to efficiently navigate materials in ways not possible without digital transcription. In aggregate, however, these transcriptions make possible entirely new ways of exploring historical materials – making it possible to rapidly identify material that one suspects may exist somewhere, without knowing in advance where that might actually be. It is also a prerequisite also to virtually any type of statistical analysis of these materials – the potential utility of which continues to increase as a larger and larger proportion of the extant corpus is transcribed.

This paper introduces a procedure for OCR of pre-modern Chinese written materials, both printed and handwritten, describing the complete process from digitized image through to automated transcription and manual correction of remaining errors, with particular attention to issues arising in this domain. The process described has been applied to over 25 million pages of pre-modern Chinese works, and the paper also introduces the Chinese Text Project platform used to both make these results available to scholars as well as provide a distributed, crowdsourced mechanism for facilitating manual corrections at scale as well as further analysis of these materials.


Noise removal

Character pitch identification

Seal isolation

Posted in Chinese, Digital Humanities | Comments Off