University of Tokyo Hands-on Workshop

Thanks to the help of Professor Nagasaki Kiyonori, I am thrilled to be holding this hands-on workshop covering usage of the Chinese Text Project and Text Tools in Tokyo this December. Details follow:

Digital Research Tools for Pre-modern Chinese Texts

講師:Donald Sturgeon 博士(ハーバード大学)

日時:2017年12月13日 17:00~20:00






※無線LANインターネット接続環境 をご用意しますが、接続可能台数に限りがありますので、できればご自分でインターネット接続環境もご用意いただけるとありがたいです。

中国古典テクストの世界最大のクラウドソーシング翻刻サイトChinese Text Project ( の開発・運営者であるDonald Sturgeon先生をハーバード大学からお招きして、このサイトの基本をご紹介して頂くとともに、提供されているテクスト分析ツールを実際に操作しながら学んで頂きます。今回のツールは、基本的にWebブラウザ上で使えるようですので、特に新しいソフトウェアをインストールする必要はありません。

Posted in Chinese, Digital Humanities | Comments Off

Digital Research Tools for Pre-modern Chinese Texts

Interactive workshop 9:00am-12:00pm, November 18, 2017, held in B129, Northwest Building, 52 Oxford St., Cambridge, MA 02138
[Download slides]

Digital methods offer increasingly powerful tools to aid in the study and analysis of historical written works, both through exploratory techniques in which previously unnoticed trends and relationships are highlighted, as well as through computer-assisted assembly of data to refute or confirm particular hypotheses. Applying such techniques in practice often requires first overcoming technical challenges – in particular access to machine-readable editions of the desired texts, as well as to tools capable of performing such analyses.

This hands-on practical workshop introduces approaches intended to reduce the technical barriers to experimenting with these techniques and evaluating their utility for particular scholarly uses. The first part of this workshop introduces the Chinese Text Project, which has grown to become the largest full-text digital library of pre-modern Chinese. While on the one hand the website offers a simple means to access commonly used functions such as full-text search for a wide range of pre-modern Chinese sources, at the same time it also provides more sophisticated mechanisms allowing for more open-ended use of its contents, as well as the ability to contribute directly to the digitization of entirely new materials.

The second part of the workshop introduces tools for performing digital textual analysis of Chinese-language materials, which may be obtained from the Chinese Text Project or elsewhere. These include identification of text reuse within and between written materials, sophisticated pattern search using regular expressions, and visualization of the results of these and other types of analysis.

Posted in Chinese, Digital Humanities | Comments Off

Unsupervised identification of text reuse in early Chinese literature

This paper will appear in Digital Scholarship in the Humanities (currently available in “Advance articles”).

Text reuse in early Chinese transmitted texts is extensive and widespread, often reflecting complex textual histories involving repeated transcription, compilation, and editing spanning many centuries and involving the work of multiple authors and editors. In this study, a fully automated method of identifying and representing complex text reuse patterns is presented, and the results evaluated by comparison to a manually compiled reference work. The resultant data is integrated into a widely used and publicly available online database system with browse, search, and visualization functionality. These same results are then aggregated to create a model of text reuse relationships at a corpus level, revealing patterns of systematic reuse among groups of texts. Lastly, the large number of reuse instances identified make possible the analysis of frequently observed string substitutions, which are observed to be strongly indicative of partial synonymy between strings.

Download the full paperAccording to Oxford University Press, this link should give you access to the PDF even if not accessing from a subscribing institution.

Posted in Chinese, Digital Humanities | Comments Off

Linking, sharing, merging: sustainable digital infrastructure for complex biographical data

Paper to be presented at Biographical Data in a Digital World, 6 November 2017, Linz.

In modeling complex humanities data, projects working within a particular domain often have overlapping but distinct priorities and goals. One common result of this is that separate systems contain overlapping data: some of the objects modeled are common to more than one system, though how they are represented may be very different in each.

While within a particular domain it can be desirable for projects to standardize their data structures and formats in order to allow for more efficient linking and exchange of data between projects, for complex datasets this can be an ambitious task in itself. An alternative approach is to identify a core set of data which it would be most beneficial to be able to query in aggregate across systems, and provide mechanisms for sharing and maintaining this data as a means through which to link between projects.

For biographical data, the clearest example of this is information about the same individual appearing in multiple systems. Focusing on this particular case, this talk presents one approach to creating and sustaining with minimal maintenance a means for establishing machine-actionable links between datasets maintained and developed by different groups, while also promoting more ambitious data sharing.

This model consists of three components: 1) schema maintainers, who define and publish a format for sharing data; 2) data providers, who make data available according to a published schema; and 3) client systems, which aggregate the data from one or more data providers adhering to a common schema. This can be used to implement a sustainable union catalog of the data, in which the catalog provides a means to directly locate information in any of the connected systems, but is not itself responsible for maintenance of data. The model is designed to be general-purpose and to extend naturally to similar use cases.

Posted in Digital Humanities, Talks and conference papers | Comments Off

Pusan National University

I’m very excited to be visiting the Department of Korean Literature in Classical Chinese at Pusan National University next week to give two talks – abstracts follow:

Old Meets New: Digital Opportunities in the Humanities
28th September 2017, 10am-12pm

The application of digital methods has brought enormous benefits to many fields of study, not only by offering more efficient ways of conducting research and teaching along traditional lines, but also by opening up entirely new directions and research questions which would have been impractical or even impossible to pursue prior to the digital age. This digital revolution offers new and exciting opportunities for many humanities subjects – including Chinese studies. Through use of computer software, digital techniques make possible large-scale studies of volumes of material which would once have been entirely impractical to study in depth due to the time and manual effort required to assemble and process the source materials. Even more excitingly, they offer the opportunity to apply sophisticated statistical techniques to give new insight and understanding into important humanities questions. In this talk I introduce examples of how and why computational methods are making possible new types of studies in the humanities in general and the study of Chinese literature and history in particular.

Computational Approaches to Chinese Literature
28th September 2017, 4-6pm

Digital methods and the emerging field of digital humanities are revolutionizing the study of literature and history. In the first part of this talk, I present the results of a computational study of parallel passages in the pre-Qin and Han corpus and use it to demonstrate how digital methods can provide new insights in the field of pre-modern Chinese literature. This study begins by implementing an automated procedure for identifying pairs of parallel passages, which is demonstrated to be more effective than prior work by human experts. The procedure is used to identify hundreds of thousands of parallels within the classical Chinese corpus, and the resulting data aggregated in order to study broader trends. The results of this quantitative study not only enable far more precise evaluation of claims made by traditional scholarship, but also the investigation of patterns of text reuse at a corpus level.

The second part of the talk introduces the Chinese Text Project digital library and associated tools for textual analysis of Chinese literature. Taken together, these provide a uniquely flexible platform for digital textual analysis of pre-modern Chinese writing, which allows for rapid experimentation with a range of digital techniques without requiring specialized technical or programming skills. Methods introduced include automated identification of text reuse, pattern matching using regular expressions, and network visualization.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

JADH Poster: DH research and teaching with digital library APIs

At this year’s Japanese Association for Digital Humanities conference, as well as giving a keynote on digital infrastructure, I also presented this poster on the specific example of full-text digital library APIs being used in and for teaching at Harvard EALC.


As digital libraries continue to grow in size and scope, their contents present ever increasing opportunities for use in data mining as well as digital humanities research and teaching. At the same time, the contents of the largest such libraries tend towards being dynamic rather than static collections of information, changing over time as new materials are added and existing materials augmented in various ways. Application Programming Interfaces (APIs) provide efficient mechanisms by which to access materials from digital libraries for data mining and digital humanities use, as well as by which to enable the distributed development of related tools. Here I present a working example of an API developed for the Chinese Text Project digital library being used to facilitate digital humanities research and teaching, while also enabling distributed development of related tools without requiring centralized administration or coordination.

Firstly, for data-mining, digital humanities teaching and research use, the API facilitates direct access to textual data and metadata in machine-readable format. In the implementation described, the API itself consists of a set of documented HTTP endpoints returning structured data in JSON format. Textual objects are identified and requested by means of stable identifiers, which can be obtained programmatically through the API itself, as well as manually through the digital library’s existing public user interface. To further facilitate use of the API by end users, native modules for several programming environments (currently including Python and JavaScript) are also provided, wrapping API calls in methods adapted to the specific environment. Though not required in order to make use of the API, these native modules greatly simplify the most common use cases, further abstract details of implementation, and make possible the creation of programs performing sophisticated operations on arbitrary textual objects using a few lines of easily understandable code. This has obvious applications in digital humanities teaching, where simple and efficient access to data in consistent formats is of considerable importance when covering complex subjects within a limited amount of classroom or lab time, and also facilitates research use in which the ability to rapidly experiment with different materials as well as prototype and reuse code with minimal effort is also of practical utility.

Secondly, along with the API itself, the provision of a plugin mechanism allowing the creation of user-definable extensions to the library’s online user interface makes possible augmentation of core library functionality through the use of external tools in ways that are transparent and intuitive to end users while also not requiring centralized coordination or approval to create or modify. Plugins consist of user-defined, sharable XML resource descriptions which can be installed into individual user accounts; the user interface uses information contained in these descriptions – such as link schemas – to send appropriate data such as textual object references to specified external resources, which can then request full-text data, metadata, and other relevant content via API and perform task-specific processing on the requested data. Any user can create a new plugin, share it with others, and take responsibility for future updates to their plugin code, without requiring central approval or coordination.

This technical framework enables a distributed web-based development model in which external projects can be loosely integrated with the digital library and its user interface, from an end user perspective being well integrated with the library, while from a technical standpoint being developed and maintained entirely independently. Currently available applications using this approach include simple plugins for basic functionality such as full-text export, the “Text Tools” plugin for textual analysis, and the “MARKUS” named entity markup interface for historical Chinese texts developed by Brent Ho and Hilde De Weerdt, as well as a large number of external online dictionaries. The “Text Tools” plugin provides a range of common text processing services and visualization methods, such as n-gram statistics, similarity comparisons of textual materials based on n-gram shingling, and regular expression search and replace, along with network graph, word cloud, and chart visualizations; “MARKUS” uses external databases of Chinese named entities together with a custom interface to mark-up texts for further analysis. Because of the standardization of format imposed by the API layer, such plugins have access not only to structured metadata about texts and editions, but also to structural information about the text itself, such as data on divisions of texts into individual chapters and paragraphs. For example, in the case of the “Text Tools” plugin this information can be used by the user to aggregate regular expression results and perform similarity comparisons by text, by chapter or by paragraph, in the latter two cases also making possible visualization of results using the integrated network graphing tool. As these tasks are facilitated by API, tools such as these can be developed and maintained without requiring knowledge of or access to the digital library’s code base or internal data structures; from an end user perspective, these plugins do not require technical knowledge to use, and can be accessed as direct extensions to the primary user interface. This distributed model of development has the potential to greatly expand the range of available features and use cases of this and other digital libraries, by providing a practical separation of concerns of data and metadata creation and curation on the one hand, and text mining, markup, visualization, and other tasks on the other, while simultaneously allowing this technical division to remain largely transparent to a user of these separately maintained and developed tools and platforms.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Collaboration at scale: emerging infrastructures for digital scholarship

Keynote lecture, Japanese Association for Digital Humanities (JADH 2017), Kyoto


Modern technological society is possible only as a result of collaborations constantly taking place between countless individuals and groups working on tasks which at first glance may seem independent from one another yet are ultimately connected through complex interdependencies. Just as technological progress is not merely a story of ever more sophisticated technologies, but also of the evolution of increasingly efficient structures facilitating their development, so too scholarship moves forward not just by the creation of ever more nuanced ideas and theories, but also by increasingly powerful means of identifying, exchanging, and building upon these ideas.

The digital medium presents revolutionary opportunities for facilitating such tasks in humanities scholarship. Most obviously, it offers the ability to perform certain types of analyses on scales larger than would ever have been practical without use of computational methods – for example the examination of trends in word usage across millions of books, or visualizations of social interactions of tens of thousands of historical individuals. But it also presents opportunities for vastly more scalable methods of collaboration between individuals and groups working on distinct yet related projects. Simple examples are readily available: computer scientists develop and publish code through open source platforms, companies further adapt it for use in commercial systems, and humanities scholars to apply it to their own research; libraries digitize and share historical works from their collections, which are transcribed by volunteers, searched and read by researchers and cited in scholarly works.

Much of the infrastructure already in use in digital scholarship is infrastructure developed for more general-purpose use – a natural and desirable development given the obvious economies of scale which result from this. However, as the application of digital methods in humanities scholarship becomes increasingly mainstream, as digitized objects of study more numerous, and related digital techniques more specialized, the value of infrastructure designed specifically to support scholarship in particular fields of study becomes increasingly apparent. This paper will examine types of humanities infrastructure projects which are emerging, and the potential they have to facilitate scalable collaboration within and beyond distributed scholarly communities.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Digital humanities and the digital library

Subtitled “OCR, crowdsourcing, and text mining of Chinese historical texts”

Paper to be presented at the CADAL Project Work Conference on Digital Resources Sharing and Application, Zhejiang University, 16 June 2017.






In this talk I present an overview of key technologies used in the Chinese Text Project, one of the largest digital libraries of pre-modern Chinese transmitted texts, the public user interface of which is currently used by over 25,000 people every day. Key technologies used fall into three main categories: Optical Character Recognition (OCR) for pre-modern Chinese texts, a practical and successful crowdsourcing interface taking advantage of a large base of users, and an open Application Programming Interface allowing both integration with other online tools and projects as well as open-ended use for text mining purposes.

Firstly, specialized OCR techniques have been developed for pre-modern Chinese texts. These techniques leverage aspects of common writing and printing styles, together with a large existing body of transcribed textual material, to implement an OCR pipeline with high accuracy and scalability. These techniques have so far been applied to over 25 million pages of pre-modern Chinese texts, and the results made freely available online.

Secondly, a unique crowdsourcing interface for editing texts created primarily via OCR enables users to correct mistakes and add additional information and metadata, allowing users around the world to meaningfully and immediately contribute to the project and to actively participate in the curation of its contents. Hundreds of corrections are received and immediately applied to the version controlled texts every day by users based around the world.

Thirdly, the creation of a specialized API for text mining use and extension of the primary user interface enables efficient access to the ever-growing data set for use in digital humanities research and teaching. Creation of specialized modules for programming languages such as Python allows for intuitive use in digital humanities teaching contexts, while simple access via JavaScript enables the creation of easy-to-use online tools which can directly access and operate on textual materials stored in the library.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Crowdsourcing a digital library of pre-modern Chinese

Seminar in the Digital Classicist London 2017 series at the Institute of Classical Studies, University of London, 9 June 2017.

Traditional digital libraries, including those in the field of pre-modern Chinese, have typically followed top-down, centralized, and static models of content creation and curation. This is a natural and well-grounded strategy for database design and implementation, with strong roots in traditional academic publishing models, and offering clear technical advantages over alternative approaches. This strategy, however, is unable to adequately meet the challenges of increasingly large-scale digitization and the resulting rapid growth in available corpus size.

In this talk I present a working example of a dynamic alternative to the conventional static model. This alternative leverages a large, distributed community of users, many of whom may not be affiliated with mainstream academia, to curate material in a way that is distributed, scalable, and does not rely upon centralized editing. In the particular case presented, initial transcriptions of scanned pre-modern works are created automatically using specially developed OCR techniques and immediately published in an online open access digital library platform called the Chinese Text Project. The online platform uses this data to implement full-text search, image search, full-text export and other features, while simultaneously facilitating correction of initial OCR results by a geographically distributed group of pseudonymous volunteer users. The online platform described is currently used by around 25,000 individual users each day. User-submitted corrections are immediately applied to the publicly available version-controlled transcriptions without prior review, but are easily validated visually by other users using simple semi-automated mechanisms. This approach allows immediate access to a “long tail” of less popular and less mainstream material which would otherwise likely be overlooked for inclusion in this type of full-text database system. To date the procedure described has been applied to over 25 million pages of historical texts, including 5 million pages from the Harvard-Yenching Library collection, and the complete results published online.

In addition to the online platform, the development of an open plugin system and API allowing customization of the user interface with user-defined extensions and immediate machine-readable access to full-text data and metadata have made possible many further use cases. These include efficient, distributed collaboration and integration with other online web platforms including projects based at Leiden University, Academia Sinica and elsewhere, as well as use in data mining, digital humanities research and teaching, and as a self-service tool for use in projects requiring the creation of proofread transcriptions of particular early texts. A Python library has also been created to further encourage use of the API; in the final part of the talk I explain how the API together with this Python library are currently being used to facilitate – and greatly simplify – digital humanities teaching at Harvard’s Department of East Asian Languages and Civilizations.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Unsupervised Extraction of Training Data for Pre-Modern Chinese OCR

Published in the Proceedings of the 30th International Florida Artificial Intelligence Research Society Conference (FLAIRS-30), 2017.


Many mainstream OCR techniques involve training a character recognition model using labeled exemplary images of each individual character to be recognized. For modern printed writing, such data can be easily created by automated methods such as rasterizing appropriate font data to produce clean example images. For historical OCR in printing and writing styles distinct from those embodied in modern fonts, appropriate character images must instead be extracted from actual historical documents to achieve good recognition accuracy. For languages with small character sets it may feasible to perform this process manually, but for languages with many thousands of characters, such as Chinese, manually collecting this data is often not practical.

This paper presents an unsupervised method to extract this data from two unaligned, unstructured, and noisy inputs: firstly, a corpus of transcribed documents; secondly, a corpus of scanned documents of the desired printing or writing style, some fraction of which are editions of texts included in the transcription corpus. The unsupervised procedure described is demonstrated capable of using this data, together with an OCR engine trained only on modern printed Chinese to retrain the same engine to recognize pre-modern Chinese texts with a 43% reduction in overall error rate.

[Full paper]

Posted in Chinese, Digital Humanities | Comments Off