In Spring 2016, I will be teaching this course on digital methods at Harvard’s EALC.
Week 1 – Introduction
- Background and basic concepts
- Representing text on a computer
- Setting up the Python environment
Week 2 – Introduction to programming
- Variables, functions, loops, and files
Week 3 – Regular expressions
- String manipulation and data extraction.
Week 4 – Working with structured data
- Associative arrays, tables, CSV files
Week 5 – Practical data manipulation
- Automated extraction of data from the web
Week 6 – Textual similarity
- Introduction to information retrieval
Week 7 – Topic modeling
- Generating and interpreting data using Mallet
Week 8 – Network visualization with Gephi
- Representing data as a network graph
Week 9 – Principal component analysis
- Exploratory data analysis in Python
Week 10 – Machine learning
- Features, classification, regression
Week 11 – Review and discussion
- What worked, what didn’t, and why
- Debugging of issues arising during project work
Week 12 – Student presentations and discussion
Coursework and Assessment
Class participation (30%)
Students are expected to attend and actively participate in the practical sessions, completing short assigned problem sets, and applying techniques introduced to their own data.
Homework assignments (30%)
Four short homework assignments will be set based upon the application of digital techniques covered.
Final presentations (40%)
Each student will give one presentation in which techniques introduced during the course are applied to a research topic in Chinese studies.
Having completed this course, students will:
- Have an understanding of how to apply digital techniques to their own projects.
- Be able to apply basic programming techniques to extract data from Chinese texts for analysis, and perform various kinds of digital analysis on the resultant data in the context of their research.
- Possess the basic skills needed to make use of the growing number of open-source Python libraries relevant to textual analysis.
6th International Conference of Digital Archives and Digital Humanities,
30 November 2015, National Taiwan University
New Perspectives on Digital Sinology Resources panel
The digital medium presents unique opportunities and challenges for the development of new kinds of resources for the study of Chinese literature. Using examples from the Chinese Text Project, I suggest ways in which digital libraries can leverage the advantages of the digital realm to offer new functionality and services at relatively low cost. This involves the exploitation of two primary avenues for scalable development: firstly the use of automation to achieve goals realistically attainable by computational methods, and secondly the encouragement of open user engagement to recruit human volunteers to assist with tasks less suited to automation.
Presentation at Harvard University, “Advancing Digital Scholarship in Japanese Studies: Innovations and Challenges” Workshop, 7 November 2015
Belfer Case Study Room, CGIS, 9.00 am
In the ten years since first going online, the Chinese Text Project has gradually expanded from a simple tool for searching and navigating a handful of early Chinese texts to become the largest publicly available full-text database of pre-modern Chinese, containing over 20,000 texts and more than 3 billion characters. In this presentation, I discuss technical and structural changes that have made this expansion possible with only limited resources. These changes involve the exploitation of two primary avenues for scalable development: firstly the use of automation to achieve goals realistically attainable by computational methods, and secondly the encouragement of open user engagement to recruit human volunteers to assist with tasks less suited to automation. Specific examples include the application of optical character recognition to both enable full-text search of scanned early editions as well as create draft transcriptions of the same texts that can be proofread by crowd-sourcing, and of natural language processing techniques to the identification of text reuse and automated compilation of dictionary data. I also introduce ongoing work including the development of Application Programming Interfaces (APIs) and related mechanisms that will allow other projects to integrate with and build upon the resources of this digital library in a decentralized way while at the same time avoiding duplication of effort.
Seminar at Harvard University, Fairbank Center for Chinese Studies, 26 October 2015
Room S153, CGIS South Building, 12.00
Textual parallels among early Chinese transmitted texts are extensive and widespread, often reflecting complex textual histories involving repeated transcription, compilation, and editing spanning many centuries and involving contributions from multiple authors and editors. Partly as a consequence of this complexity, establishing with certainty even approximate dates of authorship for texts and parts of texts is a challenging task. In this presentation, I demonstrate how digital methods grounded in textual and statistical evidence can help us better understand and visualize some of these complex relationships, and how digital methods may offer additional clues as to the likely provenance of disputed texts.
Presentation at Harvard University, Computational Methods for Chinese History: A “Digging into Data Challenge” Training Workshop, 17 October 2015.
Science Center, Room B09, 3.15pm.
From September 2015 to July 2016 I will be serving as a Postdoctoral Fellow at Harvard University’s Fairbank Center of Chinese Studies, working on (among other things) a project I’ve titled “Big Data and Early China: Corpus-Assisted Interpretation of Classical Chinese”. It’s really exciting to be here in Cambridge, and I look forward to being able to concentrate a little more on the digital humanities side of my research over the coming year.
Seminar at Leiden University, 5 August 2015
Since its origins as a database of Warring States philosophical texts, the Chinese Text Project (http://ctext.org) has gradually grown to become one of the largest digital libraries of pre-modern Chinese texts in existence, as well as a platform for applying new digital methods to the study of these texts. This seminar will introduce several unique aspects of the site from both sinological and technical perspectives, as well as discussing ongoing research, development and future goals.
A new version of the Unicode standard has been released, defining thousands of additional rarely used and variant Chinese characters. Support for these has been added to the dictionary section of the site; to view these characters, please install the latest version of the Hanazono font. Many new characters belong to “CJK Extension E” – you can confirm system support for these from the Font Test Page.
This paper, accepted April 2012, has now appeared in Philosophy East and West 65:3 (July 2015).