CHNSHIS 202: Digital Methods for Chinese Studies

In Spring 2016, I will be teaching this course on digital methods at Harvard’s EALC.


Week 1 – Introduction

  • Background and basic concepts
  • Representing text on a computer
  • Setting up the Python environment

Week 2 – Introduction to programming

  • Variables, functions, loops, and files

Week 3 – Regular expressions

  • String manipulation and data extraction.

Week 4 – Working with structured data

  • Associative arrays, tables, CSV files

Week 5 – Practical data manipulation

  • Automated extraction of data from the web

Week 6 – Textual similarity

  • Introduction to information retrieval

Week 7 – Topic modeling

  • Generating and interpreting data using Mallet

Week 8 – Network visualization with Gephi

  • Representing data as a network graph

Week 9 – Principal component analysis

  • Exploratory data analysis in Python

Week 10 – Machine learning

  • Features, classification, regression

Week 11 – Review and discussion

  • What worked, what didn’t, and why
  • Debugging of issues arising during project work

Week 12 – Student presentations and discussion

Coursework and Assessment

  • Class participation (30%)
    Students are expected to attend and actively participate in the practical sessions, completing short assigned problem sets, and applying techniques introduced to their own data.
  • Homework assignments (30%)
    Four short homework assignments will be set based upon the application of digital techniques covered.
  • Final presentations (40%)
    Each student will give one presentation in which techniques introduced during the course are applied to a research topic in Chinese studies.

Learning Outcomes

Having completed this course, students will:

  • Have an understanding of how to apply digital techniques to their own projects.
  • Be able to apply basic programming techniques to extract data from Chinese texts for analysis, and perform various kinds of digital analysis on the resultant data in the context of their research.
  • Possess the basic skills needed to make use of the growing number of open-source Python libraries relevant to textual analysis.  
Posted in Courses, Digital Humanities | Comments Off

Automation and Collaboration: Exploiting the Digital Medium

6th International Conference of Digital Archives and Digital Humanities,
30 November 2015, National Taiwan University
New Perspectives on Digital Sinology Resources panel

The digital medium presents unique opportunities and challenges for the development of new kinds of resources for the study of Chinese literature. Using examples from the Chinese Text Project, I suggest ways in which digital libraries can leverage the advantages of the digital realm to offer new functionality and services at relatively low cost. This involves the exploitation of two primary avenues for scalable development: firstly the use of automation to achieve goals realistically attainable by computational methods, and secondly the encouragement of open user engagement to recruit human volunteers to assist with tasks less suited to automation.

Posted in Digital Humanities, Talks and conference papers | Comments Off

Towards a Scalable Digital Library of Pre-Modern Chinese: From Static Database to Evolving Platform

Presentation at Harvard University, “Advancing Digital Scholarship in Japanese Studies: Innovations and Challenges” Workshop, 7 November 2015
Belfer Case Study Room, CGIS, 9.00 am

In the ten years since first going online, the Chinese Text Project has gradually expanded from a simple tool for searching and navigating a handful of early Chinese texts to become the largest publicly available full-text database of pre-modern Chinese, containing over 20,000 texts and more than 3 billion characters. In this presentation, I discuss technical and structural changes that have made this expansion possible with only limited resources. These changes involve the exploitation of two primary avenues for scalable development: firstly the use of automation to achieve goals realistically attainable by computational methods, and secondly the encouragement of open user engagement to recruit human volunteers to assist with tasks less suited to automation. Specific examples include the application of optical character recognition to both enable full-text search of scanned early editions as well as create draft transcriptions of the same texts that can be proofread by crowd-sourcing, and of natural language processing techniques to the identification of text reuse and automated compilation of dictionary data. I also introduce ongoing work including the development of Application Programming Interfaces (APIs) and related mechanisms that will allow other projects to integrate with and build upon the resources of this digital library in a decentralized way while at the same time avoiding duplication of effort.

Posted in Digital Humanities, Talks and conference papers | Comments Off

Textual Relationships in the Pre-Qin and Han Corpus: A Digital Approach

Seminar at Harvard University, Fairbank Center for Chinese Studies, 26 October 2015
Room S153, CGIS South Building, 12.00

Textual parallels among early Chinese transmitted texts are extensive and widespread, often reflecting complex textual histories involving repeated transcription, compilation, and editing spanning many centuries and involving contributions from multiple authors and editors. Partly as a consequence of this complexity, establishing with certainty even approximate dates of authorship for texts and parts of texts is a challenging task. In this presentation, I demonstrate how digital methods grounded in textual and statistical evidence can help us better understand and visualize some of these complex relationships, and how digital methods may offer additional clues as to the likely provenance of disputed texts.

Posted in Digital Humanities, Talks and conference papers | Comments Off

Exploring Text Reuse in the Pre-Qin and Han Corpus

Presentation at Harvard University, Computational Methods for Chinese History: A “Digging into Data Challenge” Training Workshop, 17 October 2015.
Science Center, Room B09, 3.15pm.

Posted in Digital Humanities, Talks and conference papers | Comments Off

Fairbank Center

From September 2015 to July 2016 I will be serving as a Postdoctoral Fellow at Harvard University’s Fairbank Center of Chinese Studies, working on (among other things) a project I’ve titled “Big Data and Early China: Corpus-Assisted Interpretation of Classical Chinese”. It’s really exciting to be here in Cambridge, and I look forward to being able to concentrate a little more on the digital humanities side of my research over the coming year.

Posted in Digital Humanities, Philosophy | Comments Off

Digitizing Early China

Seminar at Leiden University, 5 August 2015

Since its origins as a database of Warring States philosophical texts, the Chinese Text Project ( has gradually grown to become one of the largest digital libraries of pre-modern Chinese texts in existence, as well as a platform for applying new digital methods to the study of these texts. This seminar will introduce several unique aspects of the site from both sinological and technical perspectives, as well as discussing ongoing research, development and future goals.

Posted in Digital Humanities, Philosophy, Talks and conference papers | Comments Off

Chinese Text Project – Support for Unicode 8.0

A new version of the Unicode standard has been released, defining thousands of additional rarely used and variant Chinese characters. Support for these has been added to the dictionary section of the site; to view these characters, please install the latest version of the Hanazono font. Many new characters belong to “CJK Extension E” – you can confirm system support for these from the Font Test Page.

Posted in Digital Humanities | Comments Off

Zhuangzi, perspectives, and greater knowledge

This paper, accepted April 2012, has now appeared in Philosophy East and West 65:3 (July 2015).

Posted in Uncategorized | Comments Off

Chinese Text Project: over ten million pages of pre-modern Chinese texts now searchable online

Update to the CTP:

A major update to the site has been made by applying OCR to over ten million pages of transmitted texts stored in the Library, linking scanned texts where possible to digital editions that follow them. Over 3000 existing texts have been successfully linked, allowing side-by-side display and textual searching of scanned texts.

Additionally, around ten thousand new texts and editions have also been transcribed for the first time using OCR. While these transcriptions inevitably contain many errors, they make it possible for the first time to search the scanned texts and immediately locate information within them. All newly transcribed texts have been added to the Wiki – please help by correcting errors when using these resources.

For further details, please see the OCR instructions.

Posted in Digital Humanities | Comments Off