Introducing the first in a series of online tutorials covering basic digital humanities techniques using the Python programming language and the Chinese Text Project API. These tutorials are based in part on material covered in the course CHNSHIS 202: Digital Methods for Chinese Studies, which I taught at Harvard University’s Department of East Asian Languages and Civilizations in Spring 2016.
Intended audience: People with some knowledge of Chinese literature and an interest in digital humanities; no programming experience necessary.
Format: Most of these tutorials will consist of a Jupyter Notebook file. These files contain a mixture of explanations and code that can be modified and run from within your web browser. This makes it very easy to modify, play with, and extend all of the example code. You can also read the tutorials online first (you’ll need to download the files in order to run the code and do the exercises though).
Yale University, 22 April 2016
As databases, digital libraries, and digital tools grow in size and scope, they present increasingly valuable opportunities for research using novel methods including text mining, distant reading and other techniques that can be grouped under the heading “digital humanities”. At the same time, what can in practice be achieved technically using individual projects and their associated tools and materials is frequently limited by the types of use case envisioned by their creators when these resources were first designed and implemented.
Application Programming Interfaces (APIs) – standardized mechanisms through which independently developed pieces of computer software are able to share data and functionality in real time – provide one approach to greatly increasing the flexibility and thus utility of databases, digital libraries, and other tools. Key to the utility of such APIs is the possibility of functionality and content being reused in different ways by different users, without requiring central implementation of a new mechanism for each use case.
In this talk I describe how the implementation of existing third-party APIs as well as the development of a new special-purpose API for the Chinese Text Project, an online database and digital library of pre-modern Chinese texts, has opened up new opportunities for fast, efficient, and easy to use repurposing of data in a variety of contexts. These include user-driven integration with other online tools and resources (including both those already available, and those still to be constructed), statistical textual analysis and natural language processing research, and teaching and research in Chinese digital humanities.
Paper presented at AAS 2016, Seattle, April 1, 2016
The classical Chinese corpus has long been recognized to contain a vast amount of text reuse: closely related textual content that, for a variety of reasons, occurs in multiple works that might otherwise be considered to be quite independent creations ascribed to entirely different authors. Although this reuse occasionally involves explicit citation of a particular work, or acknowledgment that what follows is a widely known saying as opposed to an original invention of the author, far more often no indication is given that a passage may have been borrowed from elsewhere. Identifying such instances of reuse can shed light upon difficult issues of authorship and textual history, as well as highlight textual variations that can provide clues to the interpretation of obscure or disputed passages.
Digital methods make possible the exploration and analysis of text reuse not only in isolated instances, but systematically across a corpus of works as a whole. In this paper I propose methods of identifying two distinct types of text reuse in the classical Chinese corpus and provide an evaluation of the degrees of accuracy achieved. The first is overtly similar or “parallel” passages, which can be reliably located by defining and maximizing appropriate similarity metrics over regions of text. The second is less direct allusion to the content of earlier works, and is considerably more challenging to identify. I propose an approach that makes use of information retrieval and machine learning techniques, while also leveraging statistical data derived from the more easily identified “parallel” passages.
In Spring 2016, I will be teaching this course on digital methods at Harvard’s EALC.
Week 1 – Introduction
- Background and basic concepts
- Representing text on a computer
- Setting up the Python environment
Week 2 – Introduction to programming
- Variables, functions, loops, and files
Week 3 – Regular expressions
- String manipulation and data extraction.
Week 4 – Working with structured data
- Associative arrays, tables, CSV files
Week 5 – Practical data manipulation
- Automated extraction of data from the web
Week 6 – Textual similarity
- Introduction to information retrieval
Week 7 – Topic modeling
- Generating and interpreting data using Mallet
Week 8 – Network visualization with Gephi
- Representing data as a network graph
Week 9 – Principal component analysis
- Exploratory data analysis in Python
Week 10 – Machine learning
- Features, classification, regression
Week 11 – Review and discussion
- What worked, what didn’t, and why
- Debugging of issues arising during project work
Week 12 – Student presentations and discussion
Coursework and Assessment
Class participation (30%)
Students are expected to attend and actively participate in the practical sessions, completing short assigned problem sets, and applying techniques introduced to their own data.
Homework assignments (30%)
Four short homework assignments will be set based upon the application of digital techniques covered.
Final presentations (40%)
Each student will give one presentation in which techniques introduced during the course are applied to a research topic in Chinese studies.
Having completed this course, students will:
- Have an understanding of how to apply digital techniques to their own projects.
- Be able to apply basic programming techniques to extract data from Chinese texts for analysis, and perform various kinds of digital analysis on the resultant data in the context of their research.
- Possess the basic skills needed to make use of the growing number of open-source Python libraries relevant to textual analysis.
6th International Conference of Digital Archives and Digital Humanities,
30 November 2015, National Taiwan University
New Perspectives on Digital Sinology Resources panel
The digital medium presents unique opportunities and challenges for the development of new kinds of resources for the study of Chinese literature. Using examples from the Chinese Text Project, I suggest ways in which digital libraries can leverage the advantages of the digital realm to offer new functionality and services at relatively low cost. This involves the exploitation of two primary avenues for scalable development: firstly the use of automation to achieve goals realistically attainable by computational methods, and secondly the encouragement of open user engagement to recruit human volunteers to assist with tasks less suited to automation.
Presentation at Harvard University, “Advancing Digital Scholarship in Japanese Studies: Innovations and Challenges” Workshop, 7 November 2015
Belfer Case Study Room, CGIS, 9.00 am
In the ten years since first going online, the Chinese Text Project has gradually expanded from a simple tool for searching and navigating a handful of early Chinese texts to become the largest publicly available full-text database of pre-modern Chinese, containing over 20,000 texts and more than 3 billion characters. In this presentation, I discuss technical and structural changes that have made this expansion possible with only limited resources. These changes involve the exploitation of two primary avenues for scalable development: firstly the use of automation to achieve goals realistically attainable by computational methods, and secondly the encouragement of open user engagement to recruit human volunteers to assist with tasks less suited to automation. Specific examples include the application of optical character recognition to both enable full-text search of scanned early editions as well as create draft transcriptions of the same texts that can be proofread by crowd-sourcing, and of natural language processing techniques to the identification of text reuse and automated compilation of dictionary data. I also introduce ongoing work including the development of Application Programming Interfaces (APIs) and related mechanisms that will allow other projects to integrate with and build upon the resources of this digital library in a decentralized way while at the same time avoiding duplication of effort.
Seminar at Harvard University, Fairbank Center for Chinese Studies, 26 October 2015
Room S153, CGIS South Building, 12.00
Textual parallels among early Chinese transmitted texts are extensive and widespread, often reflecting complex textual histories involving repeated transcription, compilation, and editing spanning many centuries and involving contributions from multiple authors and editors. Partly as a consequence of this complexity, establishing with certainty even approximate dates of authorship for texts and parts of texts is a challenging task. In this presentation, I demonstrate how digital methods grounded in textual and statistical evidence can help us better understand and visualize some of these complex relationships, and how digital methods may offer additional clues as to the likely provenance of disputed texts.
Presentation at Harvard University, Computational Methods for Chinese History: A “Digging into Data Challenge” Training Workshop, 17 October 2015.
Science Center, Room B09, 3.15pm.
From September 2015 to July 2016 I will be serving as a Postdoctoral Fellow at Harvard University’s Fairbank Center of Chinese Studies, working on (among other things) a project I’ve titled “Big Data and Early China: Corpus-Assisted Interpretation of Classical Chinese”. It’s really exciting to be here in Cambridge, and I look forward to being able to concentrate a little more on the digital humanities side of my research over the coming year.
Seminar at Leiden University, 5 August 2015
Since its origins as a database of Warring States philosophical texts, the Chinese Text Project (http://ctext.org) has gradually grown to become one of the largest digital libraries of pre-modern Chinese texts in existence, as well as a platform for applying new digital methods to the study of these texts. This seminar will introduce several unique aspects of the site from both sinological and technical perspectives, as well as discussing ongoing research, development and future goals.