Digital humanities and the digital library

Subtitled “OCR, crowdsourcing, and text mining of Chinese historical texts”

Paper to be presented at the CADAL Project Work Conference on Digital Resources Sharing and Application, Zhejiang University, 16 June 2017.


数字人文与数字图书馆:中国历代文献的文字识别、群众外包及文本挖掘

本次演讲介绍中国哲学书电子化计划中的主要技术。中国哲学书电子化计划是全球最大规模的前现代中文传世文献电子图书馆之一,目前,每日有25,000多用户使用其公开操作界面。主要原创技术可归类为三种:(一)前现代中文资料的文字识别技术(OCR)、(二)借用大量用户劳力的群众外包界面、(三)既实现与其它线上工具之间的整合、又提供文本挖掘途径的开放式应用程式界面(API)。

第一个原创技术是专门为中国前现代文献设计的文字识别技术。此技术利用前现代文献常见的写作、印刷特征以及已数字化的大量文献来实现具有高精确性以及扩充性的文字识别系统。该系统已处理2,500多万页资料,其结果已在网络上公开。

第二,通过独特的群众外包界面,世界各地的用户可纠正文字识别错误,补充后设资料,从而能够及时参与数字化过程并积极协助内容的扩展。全球用户每日提供上百次的校勘,系统将此及时储存到具有版本控制功能的数据库。

第三,系统的应用程式界面可用于文本挖掘,亦可用于扩充一般使用界面的功能,
从而有效地借用日益增长的资料库文本内容来达到数字人文研究和教学的目的。通过此应用程式界面,为Python等程式语言所开发的专门组件可用于数字人文教学;JavaScript组件便于他人开发易用的线上工具,使他人所开发的应用工具能够直接读取和操作电子图书馆中的各种内容。

In this talk I present an overview of key technologies used in the Chinese Text Project, one of the largest digital libraries of pre-modern Chinese transmitted texts, the public user interface of which is currently used by over 25,000 people every day. Key technologies used fall into three main categories: Optical Character Recognition (OCR) for pre-modern Chinese texts, a practical and successful crowdsourcing interface taking advantage of a large base of users, and an open Application Programming Interface allowing both integration with other online tools and projects as well as open-ended use for text mining purposes.

Firstly, specialized OCR techniques have been developed for pre-modern Chinese texts. These techniques leverage aspects of common writing and printing styles, together with a large existing body of transcribed textual material, to implement an OCR pipeline with high accuracy and scalability. These techniques have so far been applied to over 25 million pages of pre-modern Chinese texts, and the results made freely available online.

Secondly, a unique crowdsourcing interface for editing texts created primarily via OCR enables users to correct mistakes and add additional information and metadata, allowing users around the world to meaningfully and immediately contribute to the project and to actively participate in the curation of its contents. Hundreds of corrections are received and immediately applied to the version controlled texts every day by users based around the world.

Thirdly, the creation of a specialized API for text mining use and extension of the primary user interface enables efficient access to the ever-growing data set for use in digital humanities research and teaching. Creation of specialized modules for programming languages such as Python allows for intuitive use in digital humanities teaching contexts, while simple access via JavaScript enables the creation of easy-to-use online tools which can directly access and operate on textual materials stored in the library.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Crowdsourcing a digital library of pre-modern Chinese

Seminar in the Digital Classicist London 2017 series at the Institute of Classical Studies, University of London, 9 June 2017.

Traditional digital libraries, including those in the field of pre-modern Chinese, have typically followed top-down, centralized, and static models of content creation and curation. This is a natural and well-grounded strategy for database design and implementation, with strong roots in traditional academic publishing models, and offering clear technical advantages over alternative approaches. This strategy, however, is unable to adequately meet the challenges of increasingly large-scale digitization and the resulting rapid growth in available corpus size.

In this talk I present a working example of a dynamic alternative to the conventional static model. This alternative leverages a large, distributed community of users, many of whom may not be affiliated with mainstream academia, to curate material in a way that is distributed, scalable, and does not rely upon centralized editing. In the particular case presented, initial transcriptions of scanned pre-modern works are created automatically using specially developed OCR techniques and immediately published in an online open access digital library platform called the Chinese Text Project. The online platform uses this data to implement full-text search, image search, full-text export and other features, while simultaneously facilitating correction of initial OCR results by a geographically distributed group of pseudonymous volunteer users. The online platform described is currently used by around 25,000 individual users each day. User-submitted corrections are immediately applied to the publicly available version-controlled transcriptions without prior review, but are easily validated visually by other users using simple semi-automated mechanisms. This approach allows immediate access to a “long tail” of less popular and less mainstream material which would otherwise likely be overlooked for inclusion in this type of full-text database system. To date the procedure described has been applied to over 25 million pages of historical texts, including 5 million pages from the Harvard-Yenching Library collection, and the complete results published online.

In addition to the online platform, the development of an open plugin system and API allowing customization of the user interface with user-defined extensions and immediate machine-readable access to full-text data and metadata have made possible many further use cases. These include efficient, distributed collaboration and integration with other online web platforms including projects based at Leiden University, Academia Sinica and elsewhere, as well as use in data mining, digital humanities research and teaching, and as a self-service tool for use in projects requiring the creation of proofread transcriptions of particular early texts. A Python library has also been created to further encourage use of the API; in the final part of the talk I explain how the API together with this Python library are currently being used to facilitate – and greatly simplify – digital humanities teaching at Harvard’s Department of East Asian Languages and Civilizations.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Unsupervised Extraction of Training Data for Pre-Modern Chinese OCR

Published in the Proceedings of the 30th International Florida Artificial Intelligence Research Society Conference (FLAIRS-30), 2017.

Abstract

Many mainstream OCR techniques involve training a character recognition model using labeled exemplary images of each individual character to be recognized. For modern printed writing, such data can be easily created by automated methods such as rasterizing appropriate font data to produce clean example images. For historical OCR in printing and writing styles distinct from those embodied in modern fonts, appropriate character images must instead be extracted from actual historical documents to achieve good recognition accuracy. For languages with small character sets it may feasible to perform this process manually, but for languages with many thousands of characters, such as Chinese, manually collecting this data is often not practical.

This paper presents an unsupervised method to extract this data from two unaligned, unstructured, and noisy inputs: firstly, a corpus of transcribed documents; secondly, a corpus of scanned documents of the desired printing or writing style, some fraction of which are editions of texts included in the transcription corpus. The unsupervised procedure described is demonstrated capable of using this data, together with an OCR engine trained only on modern printed Chinese to retrain the same engine to recognize pre-modern Chinese texts with a 43% reduction in overall error rate.

[Full paper]

Posted in Chinese, Digital Humanities | Comments Off

Text Tools for ctext.org

This tutorial introduces some of the main functionality of the “Text Tools” plugin for the Chinese Text Project database and digital library along with suggested example tasks and use cases.

[Online version of this tutorial: https://dsturgeon.net/texttools (English); https://dsturgeon.net/texttools-ja (Japanese)]

Initial setup

  • If you haven’t used the Chinese Text Project before, please refer to the tutorial “Practical introduction to ctext.org” for details of how to create a ctext.org account and install a plugin.
  • Make sure you are logged in to your ctext.org account.
  • If you have an API key, save it into your ctext.org account using the settings page. Alternatively if your institution subscribes to ctext.org and you are not using a computer on your university’s local network, follow your university’s instructions to connect to their VPN.
  • Install the “Text Tools” plugin (installation link) – you only need to do this once.
  • Once these steps have been completed, when you open a text or chapter of text on ctext.org, you should see a link to the Text Tools plugin.

Getting started

The Text Tools program has a number of different pages (titled “N-gram”, “Regex”, etc.) which can be switched between using the links at the top of the page. Each page corresponds to one of the tools described below, except for the Help page, which explains the basic usage and options for each of the tools. These include tools for textual analysis as well as simple data visualization.

The textual analysis tools are designed to operate on textual data which can either be read in directly from ctext.org via API, or copied into the tool from elsewhere. If you open the tool by using the ctext.org plugin, that text will be automatically loaded and displayed. To load additional texts from ctext, copy the URN for the text (or chapter) into the box labeled “Fetch text by URN” in the Text Tools window, and click “Fetch”. When the text has loaded, its contents will be displayed along with its title. To add more texts, click “Save/add another text”, then repeat the procedure. The list of currently selected texts is displayed at the top of the window.

N-grams

“N-grams” are sequences of n consecutive textual items, where n is some fixed integer (e.g. n=1, n=3, etc.). The “textual items” are usually either terms (words) or characters; for Chinese in particular, characters are frequently used rather than words because of the difficulty of accurately segmenting Chinese text into a sequence of separate words automatically. For instance, the sentence “學而時習之不亦說乎” contains the following character 3-grams (i.e. unique sequences of exactly three characters): “學而時”, “而時習”, “時習之”, “習之不”, “之不亦”, “不亦說”, “亦說乎”.

The Text Tools “N-gram” function can be used to give a simple overview of various types of word usage in Chinese texts by means of character n-grams. The simplest cases of n-grams are 1-grams, which are simply character occurrence counts or frequencies.

Exercise:
  • Try computing 1-grams for two or three texts from ctext – you will need to set “Value of n” to 1 to do this. To better visualize the trends, use the “Chart” link to plot a bar chart of the raw data. Try this with and without normalization.
  • Repeat with 2- and 3-grams.
  • If you chose texts which ought to be broadly comparable in length, try repeating with two texts of vastly different lengths and/or styles (e.g. 道德經 and 紅樓夢) with and without normalization to demonstrate how this alters the results.

Word clouds are another type of visualization that can be made with this type of data, in which labels are drawn in different sized text proportional to their frequency of occurrence (or, more usually, proportionally to the log of their frequency). Typically word clouds are created from a single text or merged corpus, using either characters or words, however the same principles extend naturally to n-grams (and regular expressions) generally, as well as to multiple texts. In Text Tools, visualizing data for multiple texts causes the data for each distinct text to be displayed in a different color. Similar comments apply regarding normalization: if counts for different texts are not normalized according to length, longer texts will naturally tend to have larger labels.

Exercise:
  • Create word clouds for a single text, and for two or more texts. Experiment with the “Use log scale” setting in the Word cloud tab – it should quickly become clear why a log scale is usually used for word clouds.

Textual similarity

The Similarity tool uses n-gram shingling to identify and visualize text reuse relationships. To use it, first load one or more texts, select any desired options, and click “Run”.

What is identified by this tool are shared n-grams between parts of the specified texts: rather than reporting all n-grams (as the N-gram tool does), this tool only reports n-grams that are repeated in more than one place, and calculates the total number of shared n-grams between pairs of chapters. Thus unlike the N-gram tool (when “minimum count is set to 1″), larger values of n will result in fewer results being reported, because shorter n-grams are more likely to occur in multiple places, while longer ones will be less common, as well as more strongly indicative of a text reuse relationship existing between the items being compared.

There are two tabs within the output for the similarity tool: the “Matched text” tab shows the n-grams which matched, with brighter shades of red corresponding to greater numbers of overlapping n-grams; the “Chapter summary” tab aggregates the counts of matched n-grams between all pairs of chapters.

Exercise:
  • Run the similarity tool on the Analects with n=5.
  • Experiment with the “Constraint” function by clicking on chapter titles to limit the display to passages having parallels with the specified chapter or pair of chapters.
  • Select a few of the matched n-grams by clicking on them; this will result in a different type of constraint showing where exactly that n-gram was matched
  • Text reuse can be visualized as a weighted network graph. You can do this for your n-gram similarity results by clicking the “Create graph” link in the “Chapter summary” tab, then clicking “Draw”.
  • Which chapters of the Analects have the strongest text reuse relationship according to this metric? You can probably see this straight away from the graph, however you can also check this numerically by returning to the Chapter summary tab, and sorting the table by similarity – clicking on the titles of columns in any Text Tools table sorts it by that column (click a second time to toggle sort order).
  • Returning to the graph (you can click the “Network” link at the top of the page to switch pages), the edges of the graph have a concrete meaning defined in terms of identified similarities. Double-clicking on an edge will reopen the Similarity tool, with the specific similarities underwriting the selected edge highlighted. Examine some of the edges using this function, including the thickest and thinnest edges.
  • Experiment with increasing and decreasing the value of n – how does this affect the results?
  • By default, the graph contains edges representing every similarity identified. Particularly for smaller values of n, some of these relationships will not be significant, and this may result in edges being drawn between almost all pairs of nodes in the graph, complicating the picture and obscuring genuine patterns. Experiment with simplifying the graph by setting a threshold (e.g. 0.001) for the “Skip edges with weight less than” setting – this will simplify the graph by removing those edges with relatively small amounts of reuse. Compare this with the results of increasing the value of n in the similarity tool, which will also decrease the number of edges as more trivial similarities are excluded.
  • The Similarity tools also works with multiple texts; if multiple texts are loaded and a graph is created, different colors will be used to distinguish between chapters of different texts. Try this with the Xunzi and the Zhuangzi, two very dissimilar texts which nonetheless do have reuse relationships with one another (this may take a few seconds to run – the similarity tool will take longer for larger amounts of text).

Regular expressions

A regular expression (often shortened to “regex”) is a pattern which can be searched for in a body of text. In the simplest case, a regular expression is simply a string of characters to search for; however by supplementing this simple idea with specially designed syntax, it is possible to express much more complex ways of searching for data.

The regex tool makes it possible to search within one or more texts for one or more regular expressions, listing matched text as well as aggregating counts of results per-text, per-chapter, or per-paragraph.

Exercise:
  • The simplest type of regular expression is simply a character string search – i.e. a list of characters in order which will match (only) that precise sequence of characters – one type of full-text search. Try searching the text of the Analects for something you would expect to appear in it (e.g. “君子”).
  • Examine the contents of the “Matched text” and “Summary” tabs.
  • Add a second search phrase (e.g. “小人”) to your search, and re-run the regex.
  • Re-run the same search again using the same two regular expressions, but changing “Group rows by” from the default “None” to “Paragraph”. When you do this, the “Summary” tab will show one row for every passage in the Analects. Try clicking on a numbered paragraph (these numbers are chosen automatically starting from the beginning of the text) – this will highlight the passage corresponding to that row.

Search results like these can be relational when grouped by a unit such as a paragraph or chapter: if two terms appear together in the same paragraph (or chapter), this can indicate some relationship between the two; if they repeatedly occur together in many paragraphs, this may indicate a stronger relationship between the two in that text. It is thus possible to use a network graph to visualize this information; you can do this in Text Tools by running regular expressions and setting “Group rows by” to “Paragraph”.

Exercise:
  • Search for the terms 父母, 君子, 小人, 禮, and 樂 in the Analects, and construct a network graph based on their co-occurrence in the same paragraphs of text.
  • Double-clicking on an edge in this graph will reopen the Regex tool, with the specific matches underwriting the selected edge highlighted. Examine some of the edges using this function, including the thickest and thinnest edges, to see what data they actually represent.
  • Using the same method but specifying a list of character names (寶玉, 黛玉, 寶釵, etc. – you can get a list of more names from Wikipedia), map out how character names co-occur in paragraphs of the Hongloumeng. Note: you will need to make sure that you choose names frequently used in the actual text (e.g. “賈寶玉” is only infrequently used; “寶玉” is far more common – and will also match cases of “賈寶玉”). This is one example of Social network analysis.
  • When you set “Group rows by” to “None”, you can temporarily add constraints to the “Matched text” view to show only those paragraphs which matched a particular search string. You can set or remove a constraint by clicking on a matched string in the “Matched text view”; you can also click the text label of an item in the “Summary” view to set that item as the constraint, and so see at a glance which paragraphs contained that particular string. Re-run your search with the same terms but in “None” mode, and use this to quickly see which passages the least-frequently occurring name from your list appeared in.

[A word of caution: when performing this type of search, it is important to examine the matched text to confirm whether "too much" may be matched, as well as whether other things may be missed. In the Hongloumeng example above, for instance, although the vast majority of string matches for "寶玉" in the text do indeed refer to 賈寶玉, another character appears later in the novel called "甄寶玉" - these will also match a simple search for the string "寶玉". In this particular example, this can be avoided by constructing a regular expression to avoid these other cases - such as the regex "(?!甄)寶玉", which will match the string "寶玉" only when it does not come immediately after a "甄".]

So far we have only used the simplest type of regular expressions. Regular expressions also allow for the specification of more complex patterns to be matched in the same way as the simple string searches we have just done – for example, the ability to specify a search for a pattern like “以[something]為[something]“, which would match things like “以和為量”, “以生為本”, or “以我為隱”. In order to do this, regular expressions are created by building on any fixed characters we want to match with the addition of “special” characters that describe patterns we are looking for.

Some of the most useful types of special syntax available in regular expressions is summarized in the following table:

. Matches any one character exactly once
[abcdef] Matches any one of the characters a,b,c,d,e,f exactly once
[^abcdef] Matches any one character other than a,b,c,d,e,f
(xyz) Matches xyz, and saves the result as a numbered group.
? After a character/group, makes that character/group optional (i.e. match zero or 1 times)
? After +, * or {…}, makes matching ungreedy (i.e. choose shortest match, not longest)
* After a character/group, makes that character/group match zero or more times
+ After a character/group, makes that character/group match one or more times
{2,5} After a character/group, makes that character/group match 2,3,4, or 5 times
{2,} After a character/group, makes that character/group match 2 or more times
{2} After a character/group, makes that character/group match exactly 2 times
\3 Matches whatever was matched into group number 3 (first group from left is numbered 1)

The syntax may seem complex, but it is quite easy to get started with. For instance, the first special syntax listed in the table above – a dot (“.”) – matches any one character. So the example above of “以[something]為[something]” can be expressed as the regular expression “以.為.”, read as “match the character ‘以’, followed by any one character, followed by ‘為’, followed by any one character”.

Exercise:
  • Try the regex “以.為.” from the example above in the Zhuangzi, using “Group by None”.
  • In the results of this regex search, you will notice that some matches may not correspond to exactly the type of expression we are really looking for. For example, the above regex will also match “以汝為?”, because punctuation characters are also counted as “characters” when matching regular expressions. One way to exclude these matches from the results is to use a negative character class (which matches everything except a specified list of characters) in the regex instead of the “.” operator (which simply matches any character). A corresponding regex for this example is “以[^。?]為[^。?]” – try this and confirm that it excludes these cases.
  • Because there are many possible punctuation characters, within Text Tools you can also use the shorthand “\W” (upper-case) to stand for any commonly used Chinese punctuation character, and “\w” (lower-case) for any character other than commonly used Chinese punctuation. You should get the same result if you try the previous regex written instead as “以\w為\w”. (Although this is a common convention for English regexes, “\w” and “\W” work slightly differently in different regex implementations and many do not support this for Chinese).
  • Write and test regular expressions to match the following in the Daodejing (ctp:dao-de-jing):
    • Any four characters where the middle is “之不” – i.e. “視之不見”, “聽之不聞”, etc.

Repetition

Repetition can be accomplished using various repetition operators and modifiers listed in the table above.

  • We can ask that any part of our regular expression be repeated some number of times using the “{a,b}” operator. This modifies the immediately preceding item in the regex (e.g. a specification of a character, or a group), requiring it to be repeated at least a times and at most b times (or any number of times, if b is left blank). If we omit the comma and just write “{a}”, this means that the preceding item must be repeated exactly a times.
  • For example, “仁.{0,10}義” will match the character “仁”, followed by anything from 0 to 10 other characters, followed by the character “義” – it will therefore match things like “仁義”, “仁為之而無以為;上義”, “仁,託宿於義”, etc.
  • The same method works with groups, and requires that the pattern specified by the group (not its contents) be repeated the specified number of times. So for instance “(人.){2,}” will match “人來人往”, “人前人後”, and also “人做人配人疼”.
  • The “+”, “*”, and “?” operators work in exactly the same way as this after a character or group: “+” is equivalent to “{1,}”, “*” to “{0,}”, and “?” to “{0,1}”. (They are, however, frequently used because they are shorter to write.)
Exercise:
  • Try the two specific examples described above (i.e. “仁.{0,10}義” and “(人.){2,}”).
  • Write and test regular expressions to match the following in the Daodejing (ctp:dao-de-jing):
    • Each “phrase” (i.e. punctuated section) of text. In other words, the first match should be “道可道”, the second should be “非常道”, and so on.
    • Match each phrase which contains the term “之” in it.
    • Match each phrase which contains the term “之” in it, but neither as the first character nor as the last.
  • Write and test regular expressions to match the following in the Mozi (ctp:mozi):
    • Any occurrences of the character “君” followed anywhere later in the same sentence by “父” (e.g. “君父”, “…君臣父…”, “君臣上下長幼之節,父…”, etc.).

Groups

Aside from repetition, a lot of the power of regular expressions comes from the ability to divide parts of a match into what are called “groups”, and express further conditions using the matched contents of these groups. This makes it possible to express much more sophisticated patterns.

  • Suppose we want to look for expressions like “君不君, “臣不臣”, “父不父”, etc. – cases where we have some character, followed by a “不”, then followed by that same character from before (i.e. we aren’t trying to match things like “人不知”).
  • We can do this by “capturing” the first character – whatever it may be – in a group, and then requiring later in our expression that we match the contents of that group again in another place.
  • Capturing something in a group is accomplished by putting parentheses around the part to capture – e.g. “(.)” matches any character and captures it in a group.
  • Groups are automatically numbered starting from 1, beginning with the leftmost opening bracket, and moving through our regex from left to right.
  • We can reference the contents of a matched group using the syntax “\1″ to match group 1, “\2″ to match group 2, etc.
  • So in our example, “(.).\1″ matches any character, followed by any character, followed by the first character again (whatever it was). Try this on the text of the Analects, then try modifying the regex so that it only matches non-punctuation characters (i.e. does not match things like “本,本”).

Another example is a common type of patterned repetition such as “禮云禮云” and “已乎已乎”. In this case, we can use exactly the same approach. One way is to write “(..)\1″ – match any two characters, then match those same two characters again; another (equivalent) way is to use two separate groups and write “(.)(.)\1\2″ – match any character X, then any character Y, then match X again and then Y again.

Exercise:
  • Write and test a regular expression which matches things like “委委佗佗”, “戰戰兢兢”, etc. in the Book of Poetry (ctp:book-of-poetry).
  • Write and test a regular expression which matches complex repetition of the style “XYZ,ZYX” in the Zhuangzi, where each of X, Y, and Z can be 1-5 characters long. Your regex should match things like “知者不言,言者不知”, “道無以興乎世,世無以興乎道”, and “安其所不安,不安其所安”.

Regex replace

The replace function works in a similar way to the regex search function: this function searches within one specified text for a specified regular expression, and replaces all occurrences of it with a specified value. Although the replacement can be a simple string of characters, it can also be designed to vary depending upon the contents of the regular expression. Specifically, anything that has been matched as a group within the search regex can be referenced in the replacement by using the syntax “$1″ to include the text match in group 1, “$2″ for group 2, etc. One common use case for regex replacements is to “tidy up” data obtained from some external source, or preparing it for use in some particular procedure.

For example:

  • Replacing “\W” with “” (an empty string) will delete all punctuation and line breaks from a text
  • Replacing “^(\w{1,20})$” with “*$1″ will add title markers to any lines which contain between 1 and 20 characters, none of which are punctuation characters – this can be useful when importing non-ctext texts.

Identifying differences between versions

The “Diff” tool provides a simple way of performing a character-by-character “Diff” of two similar pieces of text. Unlike the Similarity tool, this tool works best on input texts which are almost (but not quite) identical to one another.

Try using the Diff tool to compare the contents of the 正統道藏 edition of the 太上靈寶天尊說禳災度厄經 (ctp:wb882781) with the 重刊道藏輯要 edition of the same text (ctp:wb524325).

Network graphs

When you create a graph using the regular expression or similarity tools, the data is exported into the Network tab. For navigation instructions, refer to the “Help” tab. Graphs in the network tab can be entered in a subset of the “GraphViz” format; the graphs created by the other tabs can all be downloaded in this same format. If you would like a more flexible way of creating publication quality graphs, you can download and install Gephi (https://gephi.org/), which is also able to open these files.

Using other texts

Chinese texts from other sources besides ctext.org can be used with Text Tools. For instructions on how to prepare these, refer to the section on Loading texts on the Help page.

Creative Commons License
Posted in Chinese, Digital Humanities | Comments Off

Harvard-Yenching Library East Asian Digital Humanities Series

Looking forward to discussing the Chinese Text Project at the second meeting of this exciting new series!

Introducing the Chinese Text Project

The Chinese Text Project is an online open-access digital library that makes pre-modern Chinese texts available to readers and researchers all around the world. The site attempts to make use of the digital medium to explore new ways of interacting with these texts that are not possible in print. With over thirty thousand titles and more than five billion characters, the Chinese Text Project is also the largest database of pre-modern Chinese texts in existence. In our second meeting of the Harvard-Yenching Library Forum, Dr. Donald Sturgeon, the founder and the developer of the Chinese Text Project and now a post-doctoral fellow at the Fairbank Center for Chinese Studies, will give a short introduction to the database and the rationale behind it, followed by an open discussion on the issues of databases and digital scholarship.

Speaker: Dr. Donald Sturgeon (Fairbank Center for Chinese Studies)
Time: 12-1pm, Mar. 22 (Wed)
Location: Common Room, Harvard-Yenching Library (2 Divinity Ave)

Light refreshments provided.
Please RSVP to Feng-en Tu (hyl.eadh@gmail.com)

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Practical introduction to ctext.org

This tutorial briefly summarizes some of the most common tasks on the Chinese Text Project database and digital library from a user perspective, with suggested example tasks intended to introduce core functionality of the system.

[Online version of this tutorial: https://dsturgeon.net/ctext (English); https://dsturgeon.net/ctext-ja (Japanese)]

Initial setup

  • Create an account: Scroll down to the bottom of the left-hand pane and click “Log in”, then fill out the “If you do not have an account…” section.
  • Check font support: Under “About the site” near the top-left corner, click on “Font test page”.

Finding texts

  • Use the “Title search” function on the left-hand pane.
  • Texts that display the “” icon in title search results are linked to scanned sources.
  • Key to icons used in title search results:
    Transcription in the textual database (not user editable).
    User-editable transcription, not generated using OCR.
    User-editable transcription generated using OCR.
    Scanned copy of a particular edition of a text.
  • Exercise:
    • Locate a transcription of the 資暇集.
    • Locate a pre-Qin or Han dynasty text in the textual database.

Full-text search

  • First locate and open the transcription of the text (or chapter/juan) you wish to search in, then use the box labeled “Search” near the bottom of the left-hand pane.
  • Exercise:
    • Locate the passage in the Analects where Confucius says “君子不器”.
    • Locate all passages in the Zhuangzi where “道” is mentioned.
  • When you search a text in the textual database and get many results, you can use the “Show statistics” link at the top-right to display an interactive summary of where the results appear.

Locating text in scanned primary sources

  • On ctext, scanned representations of texts are searched by means of linked transcriptions. If a transcription has a linked text, it will be display the “” icon in the title search results.
  • Where a text is linked to a scan, clicking the “” icon to the left of any paragraph of text opens the corresponding page of the scan.
  • To search for a specific term or phrase in a scanned text, search the associated transcription for the term or phrase, then click the “” icon to the left of the search result.
  • Particularly in cases where transcriptions have been created using OCR, errors in the transcription may mean that longer phrases are not matched. Try searching for a shorter phrase, or using words that should appear nearby the text you are looking for.
  • Exercise:
    • Locate a text with scan links, and try searching and viewing the results within the scanned representation.
    • Try doing this with an OCR-derived transcription
    • Texts can also be searched from the “Library” section of the site containing the scanned texts – this produces exactly the same result as searching the linked transcription.
    • You can also navigate through the scanned representation page by page using the links provided.

Finding parallels to a passage of text

Available for pre-Qin and Han texts and Leishu in the textual database.

  • Locate a passage of text, then click the “” icon to display the parallel summary.
  • Within the results, click the “” icon beside a heading to display each result along with the context in which it occurs.
  • Exercise:
    • Find what parallels there are in the classical corpus to the famous passage in the Zhuangzi describing Pao-ding 庖丁 cutting up an ox.

Finding parallels between two particular texts

Available for pre-Qin and Han texts and Leishu in the textual database.

  • Click the “Advanced search” link towards the bottom of the left-hand pane.
  • In the section labeled “1. Scope”, select the first category, text, or textual unit you wish to search in. For example, to search within the Zhuangzi, you would choose “Pre-Qin and Han”, “Daoism”, then “Zhuangzi” (and leave the fourth box with “[All]” in it).
  • In the section labeled “3. Search parameters”, tick the box under “Parallel passage search”, and select in the category, text, or textual unit you wish to locate parallels to in the same way as the previous step.
  • Click “Search”. The results will list all passages containing parallels.
  • Exercise:
    • Try locating all parallels between the Analects and texts in the “Daoism” category.
    • When you have the results, try clicking the “Show statistics” link.
    • Perform the same search again, but with the “Scope” and “Search parameters” reversed, and again try using “Show statistics”.

Locating text by concordance number

Available for texts with concordance data.

  • First open the contents page for the transcription of the text. On the right hand side of the page, one search box will be displayed for each type of concordance number supported for that text.
  • Exercise:
    • In this paper by Eric Hutton, the author uses concordance numbers from both the ICS series and the Harvard Yenching series to identify textual references without quoting the Chinese text, e.g. here:


      Use the concordance lookup function on ctext.org to locate the original Chinese corresponding to the passage the author translates and references in footnote 17 above in Xunzi.

Getting the concordance number for a piece of text

Available for texts with concordance data.

  • Click the “” icon to the left of any passage with concordance data available.
  • Moving the mouse over the displayed passage will cause all concordance lines to which that segment of text belongs to be displayed.
  • To display only those concordance lines relevant to a particular part of the passage, click and drag with the mouse to highlight part of the text (which will appear in green). All concordance lines containing any of the highlighted section of text will be listed.
  • Exercise:
    • Following on from the previous example, identify the concordance lines which correspond to the line “人之性惡,其善者偽也。” in the Xunzi.

Viewing text and translation side by side

Available for texts with an English translation.

  • Normally when viewing texts with translations, one (potentially quite long) paragraph of Chinese is displayed, followed by the corresponding paragraph of English. It is also possible to display the text and translation aligned much more closely (usually phrase by phrase). To do this, click the “” icon to the left of a passage of text. This function also displays available dictionary information when the mouse cursor is moved over the Chinese text.
  • For long passages, it may be useful to jump straight to the part of the translation corresponding to a particular part of the Chinese text. To do this, search the text for a Chinese phrase, then click the “” icon as before.
  • Exercise:
    • Experiment with the text of the Zhuangzi.
    • Use this to quickly see how James Legge translated the line “每至於族,吾見其難為,怵然為戒,視為止,行為遲” in that same text.

Viewing commentary

Available for selected texts including the Analects, Mengzi, Mozi, Dao De Jing

  • Click the “” icon to the left of a passage of text. Note that commentaries are also independent texts, so you can use the links in the displayed commentary to switch to reading the commentary instead of the uncommented text.
  • Exercise:
    • Experiment with the Analects, Mengzi, Mozi, or Dao De Jing.

Locating/inputting obscure and variant characters

  • Open the “Dictionary” section of the site.
  • Depending on the character, you may want to use:
    • Direct input, i.e. just type the character in
    • Component lookup – refer to the brief instructions on the main dictionary page.
    • Radical lookup – first select the radical, then look at the additional stroke count. You can increase the size of the displayed characters by clicking on the “n strokes” label.
  • Exercise:
    • 䊫, 𥼺, 𧤴, … – try locating these on ctext.org (without copying and pasting from this page!)
  • Hint: Where you do not have an easy method of inputting either component, you can instead input any other character containing that component, use its decomposition to locate the component, and then search for other characters which also contain that component.
  • Support for variant characters which do not exist in Unicode is being added to ctext.org. These can currently be looked up by components only.
    • Non-Unicode characters can be copied and pasted for use within ctext. When a non-Unicode character is copied, it becomes a “ctext:nnnn” identifier (e.g. ctext:1591). Pasting this into other software (e.g. Microsoft Word) will paste this identifier, not a character or image.
    • You can, however, copy the image of the character from ctext by right-clicking on the character and choosing “Copy image”, and paste this into a Word document. Please remember to cite ctext.org as the source of the image (e.g. by referencing the “ctext:nnnn” identifier, or providing its URL).
    • Examples: ctext:4543 ctext:8668 ctext:3000 ctext:335

Editing a transcription

The easiest way to correct transcriptions for texts which are linked to scans is to use the “Quick edit” function. To do this:

  • Locate the page of the scan on which the transcription error occurs.
  • Click the “Quick edit” link.
  • An editable transcription of the content of that page will appear. In general, each line of the transcription should correspond to one column of the scanned text.
  • Carefully modify the transcription to agree with the scan, and click “Save changes” when done.
  • If spaces are necessary, be sure to use full-width Chinese spaces and not half-width English spaces.
  • Exercise:
    • Choose a transcription which has been created automatically with OCR, and make a correction to it.
  • “Versioning” – recording every change made to each text, and maintaining the option to revert to a previous state – is fundamental to the operation of any wiki system. After you have saved your edit:
    • Open the textual transcription that you edited by clicking the “View” link.
    • Scroll up to the top of that page and click “History” to display the list of recent revisions; your recent edit should be listed at the top.
    • Each row represents the state of the transcription after a particular edit was made. Two edits can be compared by selecting the two using the two sets of radio buttons at the left of the table and clicking “Compare”; the default selections always compare the most recent edit with the state of the text prior to that edit. Click “Compare” to a visualization of the edit you just made.

Install and use a plugin

Plugins allow customization of the site’s user interface to support additional functionality. Common use cases include downloading textual data, and connecting to third-party character dictionaries. In order to use a plugin, you must first install it into your account (you only need to do this once for each plugin). To install a plugin:

  • Open “About the site” > “Tools” > “Plugins”.
  • Locate the plugin you wish to add and click “Install”.
  • Click “Install” on the confirmation page which appears.

Once installed, when you open any supported object on ctext.org (e.g. a chapter of text for a “book” or “chapter” plugin, or a character in the dictionary for a “character” or “word” type plugin), a corresponding link will be displayed in a bar near the top of the screen.
Exercise:

  • Install the “Plain text” plugin and use it to export a chapter of a text.
  • Install the “Frequencies” plugin and use it to view character frequencies in a chapter of a text.
  • Install any plugin with the “character” or “word” type, look up a character in the dictionary, and use the plugin to access the external dictionary.

Advanced topics

The following introduce more advanced topics, which require additional effort and/or additional technical skills beyond the scope of this tutorial.

Creating a new plugin

Plugins are programmatic descriptions in XML of a method of connecting the ctext.org user interface to an external resource (typically another website). New plugins can be created directly from within your ctext.org account by modifying the code for an existing plugin. To see what your installed plugins look like, click on “Settings” in the left hand pane, then click the “editing your XML plugin file” link.

You can create a new plugin by duplicating the code between the “<Plugin>…</Plugin>” tags and editing the duplicate. You should remove the “<Update>” tag from your new plugin, as this may otherwise cause it to be overwritten in future. You can also create a standalone XML file (refer to the many existing examples), host it on your own server, and then install it into your ctext.org account.

If you encounter issues with your new plugin or the code is not accepted by the ctext interface, you can use the W3C Markup Validator to confirm that your plugin file is valid. A valid plugin file should give a green “This document was successfully checked as CTPPlugin!” result which looks like this.

Programmatic access

Textual material from the site can also be accessed directly from a programming language such as Python to use for text mining and digital humanities purposes. This requires some additional setup and investment of time to achieve, particularly if you have not programmed before, but step-by-step instructions are available online.

Programmatic access is facilitated by the ctext.org Application Programming Interface (API), and can be achieved from any language or environment capable of sending HTTP requests. Python is particularly recommended, because a ctext Python module exists which can be used to access the API with very little work. In addition to the general documentation for the API, documentation for all API functions is available and includes working live examples of each.

Creative Commons License
Posted in Chinese, Digital Humanities | Comments Off

Towards a sustainable digital infrastructure for historical Chinese texts

Paper to be presented at the Open Conference on Digital Infrastructures for Global Philology, Leipzig University, 21 February 2017.
[Download slides]

This paper describes the current status and initial results of an ongoing project to create a scalable and sustainable infrastructure for the transcription, curation, use and distribution of pre-modern Chinese textual material. The material created is accessed through a purpose-built web interface (http://ctext.org) by around 25,000 individual users every day; this interface currently ranks as one of the 3000 most frequently visited websites on the Internet in both Taiwan and Hong Kong. While also offering full-text database functionality, from an infrastructural perspective the project is composed of three main components, each designed to be usable individually or in combination to fulfil a diverse set of use cases.

The first of these is a practical Optical Character Recognition (OCR) procedure for historical Chinese documents. OCR for pre-modern Chinese is challenging for a number of technical reasons, including the large numbers of distinct characters involved, but the pre-modern domain also offers potential advantages, including opportunities for taking advantage of features relatively constant across such pre-modern works, such as standardized layouts and writing conventions, and the possibility of leveraging text reuse to improve OCR performance. Given the large volume of extant material together with the rate at which libraries and other scanning centers are scanning pre-modern Chinese works, OCR represents the only practical means by which to transcribe many of these texts in the short to medium term – particularly when considering the “long tail” of less popular and less mainstream material. So far the procedure described has been applied to over 25 million pages of historical texts, including most recently 5 million pages from the Harvard Yenching Library collection, and the results released online.

The second component is an open, online crowdsourcing interface allowing the ongoing correction of such textual transcriptions. Transcriptions created using OCR are imported into this system, which immediately enables their use for full-text image search, and at the same time encourages users to correct mistakes in OCR output as they encounter them. Submitted corrections are applied immediately, and logged in a version control system providing appropriate visualizations of changes made; the system currently receives hundreds of user-generated corrections of this type each day. Users are able to correct errors introduced by the OCR procedure, as well as supplement these results with additional data such as punctuation (typically not recorded in the scanned texts) and markup describing logical structure. Metadata curation is also integrated into the crowdsourcing system.

The third component is an open Application Programming Interface (API) allowing access to full-text data and metadata created and curated through OCR and crowdsourcing as well as by other means. This provides access to machine-readable data about texts and their contents in a flexible way. In order to encourage use of the API to allow better integration with other online projects, in addition to the API itself an open plugin system has been developed, allowing users to extend the user interface of the system in flexible ways and link it to external projects without requiring central coordination or approval, as well as to freely share these extensions with other users. Both the API and plugin system are already in active use, enabling concrete collaboration and decentralized integration with projects based at Leiden University, Academia Sinica, and many others. As the API also allows machine-readable access to what is now the world’s largest database of pre-modern Chinese writing, it also has obvious applications in the fields of text mining and digital humanities. In order to further facilitate such use of the data in research and teaching, a Python library is also available; the API together with this library are currently used to facilitate digital humanities teaching at Harvard’s Department of East Asian Languages and Civilizations.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Deep Dive into Digital and Data Methods for Chinese Studies

I’m really looking forward to taking part in the University of Michigan’s “Deep Dive into Digital and Data Methods for Chinese Studies” series later this month, where I’ll be leading the following sessions:

Text Reuse in Early Chinese Texts: A Digital Approach (Lecture)
Monday, Feb. 13, 2017 12:00 pm – 1:00 pm
Location: Clark Library Instructional Space (240 Hatcher Graduate Library)

Chinese Text Project: Historical Texts in a Digital Age (Workshop)
Monday, Feb. 13, 2017 3:30 pm – 5:00 pm
Location: Hatcher Gallery Lab (100 Hatcher Library)

Practical Large-Scale OCR of Historical Chinese Documents (Presentation and Roundtable Discussion)
Tuesday, Feb. 14, 2017 2:30-3:30 pm
Location: Asia Library Conference Room (421A Hatcher Library North)

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Classical Chinese Literature in a Digital Age

I’m very excited to be visiting Tsukuba University in Japan next week, where I will be giving a talk titled “Classical Chinese Literature in a Digital Age” (December 15), and also presenting a paper on “Optical Character Recognition for pre-modern Chinese Texts” at a Digital Humanities workshop (December 16).

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Towards a dynamic, scalable digital library of pre-modern Chinese

Paper to be presented at the 7th International Conference of Digital Archives and Digital Humanities, December 2016, National Taiwan University

This paper contrasts two radically different approaches to full-text digital library design and implementation: firstly, the “static database approach”, in which materials are firstly created, edited, and manually reviewed before being added to a generally static database system; secondly, dynamic approaches in which incompletely reviewed materials are imported into a dynamic system providing similar functionality, but within which significant further editing is intended to take place. To illustrate the technical challenges, benefits, and practical consequences of these two design approaches as reflected in a large-scale digital system, specific examples are drawn from the Chinese Text Project digital library, which initially began as a primarily static database system, and has over time evolved into a primarily dynamic platform. This change has been motivated in particular by a desire to achieve a scalable, sustainable platform for the curation of textual data and metadata, to which new material can be easily added as well as improved over time, while requiring minimal administrative overhead. This paper argues that while there are technical challenges to a dynamic approach, the increase in scalability dynamic approaches offer can have significant advantages, including potential access to a “long tail” of data which might otherwise in practice be overlooked.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off