Digital Research Tools for Pre-modern Chinese Texts

Interactive workshop 9:00am-12:00pm, November 18, 2017, held in B129, Northwest Building, 52 Oxford St., Cambridge, MA 02138
RSVP at https://goo.gl/ac1K96

Digital methods offer increasingly powerful tools to aid in the study and analysis of historical written works, both through exploratory techniques in which previously unnoticed trends and relationships are highlighted, as well as through computer-assisted assembly of data to refute or confirm particular hypotheses. Applying such techniques in practice often requires first overcoming technical challenges – in particular access to machine-readable editions of the desired texts, as well as to tools capable of performing such analyses.

This hands-on practical workshop introduces approaches intended to reduce the technical barriers to experimenting with these techniques and evaluating their utility for particular scholarly uses. The first part of this workshop introduces the Chinese Text Project, which has grown to become the largest full-text digital library of pre-modern Chinese. While on the one hand the website offers a simple means to access commonly used functions such as full-text search for a wide range of pre-modern Chinese sources, at the same time it also provides more sophisticated mechanisms allowing for more open-ended use of its contents, as well as the ability to contribute directly to the digitization of entirely new materials.

The second part of the workshop introduces tools for performing digital textual analysis of Chinese-language materials, which may be obtained from the Chinese Text Project or elsewhere. These include identification of text reuse within and between written materials, sophisticated pattern search using regular expressions, and visualization of the results of these and other types of analysis.

Posted in Chinese, Digital Humanities | Comments Off

Unsupervised identification of text reuse in early Chinese literature

This paper will appear in Digital Scholarship in the Humanities (currently available in “Advance articles”).

Text reuse in early Chinese transmitted texts is extensive and widespread, often reflecting complex textual histories involving repeated transcription, compilation, and editing spanning many centuries and involving the work of multiple authors and editors. In this study, a fully automated method of identifying and representing complex text reuse patterns is presented, and the results evaluated by comparison to a manually compiled reference work. The resultant data is integrated into a widely used and publicly available online database system with browse, search, and visualization functionality. These same results are then aggregated to create a model of text reuse relationships at a corpus level, revealing patterns of systematic reuse among groups of texts. Lastly, the large number of reuse instances identified make possible the analysis of frequently observed string substitutions, which are observed to be strongly indicative of partial synonymy between strings.

Download the full paperthis link should give you access to the PDF even if not accessing from a subscribing institution.

Posted in Uncategorized | Comments Off

Linking, sharing, merging: sustainable digital infrastructure for complex biographical data

Paper to be presented at Biographical Data in a Digital World, 6 November 2017, Linz.

In modeling complex humanities data, projects working within a particular domain often have overlapping but distinct priorities and goals. One common result of this is that separate systems contain overlapping data: some of the objects modeled are common to more than one system, though how they are represented may be very different in each.

While within a particular domain it can be desirable for projects to standardize their data structures and formats in order to allow for more efficient linking and exchange of data between projects, for complex datasets this can be an ambitious task in itself. An alternative approach is to identify a core set of data which it would be most beneficial to be able to query in aggregate across systems, and provide mechanisms for sharing and maintaining this data as a means through which to link between projects.

For biographical data, the clearest example of this is information about the same individual appearing in multiple systems. Focusing on this particular case, this talk presents one approach to creating and sustaining with minimal maintenance a means for establishing machine-actionable links between datasets maintained and developed by different groups, while also promoting more ambitious data sharing.

This model consists of three components: 1) schema maintainers, who define and publish a format for sharing data; 2) data providers, who make data available according to a published schema; and 3) client systems, which aggregate the data from one or more data providers adhering to a common schema. This can be used to implement a sustainable union catalog of the data, in which the catalog provides a means to directly locate information in any of the connected systems, but is not itself responsible for maintenance of data. The model is designed to be general-purpose and to extend naturally to similar use cases.

Posted in Uncategorized | Comments Off

Pusan National University

I’m very excited to be visiting the Department of Korean Literature in Classical Chinese at Pusan National University next week to give two talks – abstracts follow:

Old Meets New: Digital Opportunities in the Humanities
28th September 2017, 10am-12pm

The application of digital methods has brought enormous benefits to many fields of study, not only by offering more efficient ways of conducting research and teaching along traditional lines, but also by opening up entirely new directions and research questions which would have been impractical or even impossible to pursue prior to the digital age. This digital revolution offers new and exciting opportunities for many humanities subjects – including Chinese studies. Through use of computer software, digital techniques make possible large-scale studies of volumes of material which would once have been entirely impractical to study in depth due to the time and manual effort required to assemble and process the source materials. Even more excitingly, they offer the opportunity to apply sophisticated statistical techniques to give new insight and understanding into important humanities questions. In this talk I introduce examples of how and why computational methods are making possible new types of studies in the humanities in general and the study of Chinese literature and history in particular.

Computational Approaches to Chinese Literature
28th September 2017, 4-6pm

Digital methods and the emerging field of digital humanities are revolutionizing the study of literature and history. In the first part of this talk, I present the results of a computational study of parallel passages in the pre-Qin and Han corpus and use it to demonstrate how digital methods can provide new insights in the field of pre-modern Chinese literature. This study begins by implementing an automated procedure for identifying pairs of parallel passages, which is demonstrated to be more effective than prior work by human experts. The procedure is used to identify hundreds of thousands of parallels within the classical Chinese corpus, and the resulting data aggregated in order to study broader trends. The results of this quantitative study not only enable far more precise evaluation of claims made by traditional scholarship, but also the investigation of patterns of text reuse at a corpus level.

The second part of the talk introduces the Chinese Text Project digital library and associated tools for textual analysis of Chinese literature. Taken together, these provide a uniquely flexible platform for digital textual analysis of pre-modern Chinese writing, which allows for rapid experimentation with a range of digital techniques without requiring specialized technical or programming skills. Methods introduced include automated identification of text reuse, pattern matching using regular expressions, and network visualization.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

JADH Poster: DH research and teaching with digital library APIs

At this year’s Japanese Association for Digital Humanities conference, as well as giving a keynote on digital infrastructure, I also presented this poster on the specific example of full-text digital library APIs being used in ctext.org and for teaching at Harvard EALC.

Abstract

As digital libraries continue to grow in size and scope, their contents present ever increasing opportunities for use in data mining as well as digital humanities research and teaching. At the same time, the contents of the largest such libraries tend towards being dynamic rather than static collections of information, changing over time as new materials are added and existing materials augmented in various ways. Application Programming Interfaces (APIs) provide efficient mechanisms by which to access materials from digital libraries for data mining and digital humanities use, as well as by which to enable the distributed development of related tools. Here I present a working example of an API developed for the Chinese Text Project digital library being used to facilitate digital humanities research and teaching, while also enabling distributed development of related tools without requiring centralized administration or coordination.

Firstly, for data-mining, digital humanities teaching and research use, the API facilitates direct access to textual data and metadata in machine-readable format. In the implementation described, the API itself consists of a set of documented HTTP endpoints returning structured data in JSON format. Textual objects are identified and requested by means of stable identifiers, which can be obtained programmatically through the API itself, as well as manually through the digital library’s existing public user interface. To further facilitate use of the API by end users, native modules for several programming environments (currently including Python and JavaScript) are also provided, wrapping API calls in methods adapted to the specific environment. Though not required in order to make use of the API, these native modules greatly simplify the most common use cases, further abstract details of implementation, and make possible the creation of programs performing sophisticated operations on arbitrary textual objects using a few lines of easily understandable code. This has obvious applications in digital humanities teaching, where simple and efficient access to data in consistent formats is of considerable importance when covering complex subjects within a limited amount of classroom or lab time, and also facilitates research use in which the ability to rapidly experiment with different materials as well as prototype and reuse code with minimal effort is also of practical utility.

Secondly, along with the API itself, the provision of a plugin mechanism allowing the creation of user-definable extensions to the library’s online user interface makes possible augmentation of core library functionality through the use of external tools in ways that are transparent and intuitive to end users while also not requiring centralized coordination or approval to create or modify. Plugins consist of user-defined, sharable XML resource descriptions which can be installed into individual user accounts; the user interface uses information contained in these descriptions – such as link schemas – to send appropriate data such as textual object references to specified external resources, which can then request full-text data, metadata, and other relevant content via API and perform task-specific processing on the requested data. Any user can create a new plugin, share it with others, and take responsibility for future updates to their plugin code, without requiring central approval or coordination.

This technical framework enables a distributed web-based development model in which external projects can be loosely integrated with the digital library and its user interface, from an end user perspective being well integrated with the library, while from a technical standpoint being developed and maintained entirely independently. Currently available applications using this approach include simple plugins for basic functionality such as full-text export, the “Text Tools” plugin for textual analysis, and the “MARKUS” named entity markup interface for historical Chinese texts developed by Brent Ho and Hilde De Weerdt, as well as a large number of external online dictionaries. The “Text Tools” plugin provides a range of common text processing services and visualization methods, such as n-gram statistics, similarity comparisons of textual materials based on n-gram shingling, and regular expression search and replace, along with network graph, word cloud, and chart visualizations; “MARKUS” uses external databases of Chinese named entities together with a custom interface to mark-up texts for further analysis. Because of the standardization of format imposed by the API layer, such plugins have access not only to structured metadata about texts and editions, but also to structural information about the text itself, such as data on divisions of texts into individual chapters and paragraphs. For example, in the case of the “Text Tools” plugin this information can be used by the user to aggregate regular expression results and perform similarity comparisons by text, by chapter or by paragraph, in the latter two cases also making possible visualization of results using the integrated network graphing tool. As these tasks are facilitated by API, tools such as these can be developed and maintained without requiring knowledge of or access to the digital library’s code base or internal data structures; from an end user perspective, these plugins do not require technical knowledge to use, and can be accessed as direct extensions to the primary user interface. This distributed model of development has the potential to greatly expand the range of available features and use cases of this and other digital libraries, by providing a practical separation of concerns of data and metadata creation and curation on the one hand, and text mining, markup, visualization, and other tasks on the other, while simultaneously allowing this technical division to remain largely transparent to a user of these separately maintained and developed tools and platforms.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Collaboration at scale: emerging infrastructures for digital scholarship

Keynote lecture, Japanese Association for Digital Humanities (JADH 2017), Kyoto

Abstract

Modern technological society is possible only as a result of collaborations constantly taking place between countless individuals and groups working on tasks which at first glance may seem independent from one another yet are ultimately connected through complex interdependencies. Just as technological progress is not merely a story of ever more sophisticated technologies, but also of the evolution of increasingly efficient structures facilitating their development, so too scholarship moves forward not just by the creation of ever more nuanced ideas and theories, but also by increasingly powerful means of identifying, exchanging, and building upon these ideas.

The digital medium presents revolutionary opportunities for facilitating such tasks in humanities scholarship. Most obviously, it offers the ability to perform certain types of analyses on scales larger than would ever have been practical without use of computational methods – for example the examination of trends in word usage across millions of books, or visualizations of social interactions of tens of thousands of historical individuals. But it also presents opportunities for vastly more scalable methods of collaboration between individuals and groups working on distinct yet related projects. Simple examples are readily available: computer scientists develop and publish code through open source platforms, companies further adapt it for use in commercial systems, and humanities scholars to apply it to their own research; libraries digitize and share historical works from their collections, which are transcribed by volunteers, searched and read by researchers and cited in scholarly works.

Much of the infrastructure already in use in digital scholarship is infrastructure developed for more general-purpose use – a natural and desirable development given the obvious economies of scale which result from this. However, as the application of digital methods in humanities scholarship becomes increasingly mainstream, as digitized objects of study more numerous, and related digital techniques more specialized, the value of infrastructure designed specifically to support scholarship in particular fields of study becomes increasingly apparent. This paper will examine types of humanities infrastructure projects which are emerging, and the potential they have to facilitate scalable collaboration within and beyond distributed scholarly communities.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Digital humanities and the digital library

Subtitled “OCR, crowdsourcing, and text mining of Chinese historical texts”

Paper to be presented at the CADAL Project Work Conference on Digital Resources Sharing and Application, Zhejiang University, 16 June 2017.


数字人文与数字图书馆:中国历代文献的文字识别、群众外包及文本挖掘

本次演讲介绍中国哲学书电子化计划中的主要技术。中国哲学书电子化计划是全球最大规模的前现代中文传世文献电子图书馆之一,目前,每日有25,000多用户使用其公开操作界面。主要原创技术可归类为三种:(一)前现代中文资料的文字识别技术(OCR)、(二)借用大量用户劳力的群众外包界面、(三)既实现与其它线上工具之间的整合、又提供文本挖掘途径的开放式应用程式界面(API)。

第一个原创技术是专门为中国前现代文献设计的文字识别技术。此技术利用前现代文献常见的写作、印刷特征以及已数字化的大量文献来实现具有高精确性以及扩充性的文字识别系统。该系统已处理2,500多万页资料,其结果已在网络上公开。

第二,通过独特的群众外包界面,世界各地的用户可纠正文字识别错误,补充后设资料,从而能够及时参与数字化过程并积极协助内容的扩展。全球用户每日提供上百次的校勘,系统将此及时储存到具有版本控制功能的数据库。

第三,系统的应用程式界面可用于文本挖掘,亦可用于扩充一般使用界面的功能,
从而有效地借用日益增长的资料库文本内容来达到数字人文研究和教学的目的。通过此应用程式界面,为Python等程式语言所开发的专门组件可用于数字人文教学;JavaScript组件便于他人开发易用的线上工具,使他人所开发的应用工具能够直接读取和操作电子图书馆中的各种内容。

In this talk I present an overview of key technologies used in the Chinese Text Project, one of the largest digital libraries of pre-modern Chinese transmitted texts, the public user interface of which is currently used by over 25,000 people every day. Key technologies used fall into three main categories: Optical Character Recognition (OCR) for pre-modern Chinese texts, a practical and successful crowdsourcing interface taking advantage of a large base of users, and an open Application Programming Interface allowing both integration with other online tools and projects as well as open-ended use for text mining purposes.

Firstly, specialized OCR techniques have been developed for pre-modern Chinese texts. These techniques leverage aspects of common writing and printing styles, together with a large existing body of transcribed textual material, to implement an OCR pipeline with high accuracy and scalability. These techniques have so far been applied to over 25 million pages of pre-modern Chinese texts, and the results made freely available online.

Secondly, a unique crowdsourcing interface for editing texts created primarily via OCR enables users to correct mistakes and add additional information and metadata, allowing users around the world to meaningfully and immediately contribute to the project and to actively participate in the curation of its contents. Hundreds of corrections are received and immediately applied to the version controlled texts every day by users based around the world.

Thirdly, the creation of a specialized API for text mining use and extension of the primary user interface enables efficient access to the ever-growing data set for use in digital humanities research and teaching. Creation of specialized modules for programming languages such as Python allows for intuitive use in digital humanities teaching contexts, while simple access via JavaScript enables the creation of easy-to-use online tools which can directly access and operate on textual materials stored in the library.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Crowdsourcing a digital library of pre-modern Chinese

Seminar in the Digital Classicist London 2017 series at the Institute of Classical Studies, University of London, 9 June 2017.

Traditional digital libraries, including those in the field of pre-modern Chinese, have typically followed top-down, centralized, and static models of content creation and curation. This is a natural and well-grounded strategy for database design and implementation, with strong roots in traditional academic publishing models, and offering clear technical advantages over alternative approaches. This strategy, however, is unable to adequately meet the challenges of increasingly large-scale digitization and the resulting rapid growth in available corpus size.

In this talk I present a working example of a dynamic alternative to the conventional static model. This alternative leverages a large, distributed community of users, many of whom may not be affiliated with mainstream academia, to curate material in a way that is distributed, scalable, and does not rely upon centralized editing. In the particular case presented, initial transcriptions of scanned pre-modern works are created automatically using specially developed OCR techniques and immediately published in an online open access digital library platform called the Chinese Text Project. The online platform uses this data to implement full-text search, image search, full-text export and other features, while simultaneously facilitating correction of initial OCR results by a geographically distributed group of pseudonymous volunteer users. The online platform described is currently used by around 25,000 individual users each day. User-submitted corrections are immediately applied to the publicly available version-controlled transcriptions without prior review, but are easily validated visually by other users using simple semi-automated mechanisms. This approach allows immediate access to a “long tail” of less popular and less mainstream material which would otherwise likely be overlooked for inclusion in this type of full-text database system. To date the procedure described has been applied to over 25 million pages of historical texts, including 5 million pages from the Harvard-Yenching Library collection, and the complete results published online.

In addition to the online platform, the development of an open plugin system and API allowing customization of the user interface with user-defined extensions and immediate machine-readable access to full-text data and metadata have made possible many further use cases. These include efficient, distributed collaboration and integration with other online web platforms including projects based at Leiden University, Academia Sinica and elsewhere, as well as use in data mining, digital humanities research and teaching, and as a self-service tool for use in projects requiring the creation of proofread transcriptions of particular early texts. A Python library has also been created to further encourage use of the API; in the final part of the talk I explain how the API together with this Python library are currently being used to facilitate – and greatly simplify – digital humanities teaching at Harvard’s Department of East Asian Languages and Civilizations.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Unsupervised Extraction of Training Data for Pre-Modern Chinese OCR

Published in the Proceedings of the 30th International Florida Artificial Intelligence Research Society Conference (FLAIRS-30), 2017.

Abstract

Many mainstream OCR techniques involve training a character recognition model using labeled exemplary images of each individual character to be recognized. For modern printed writing, such data can be easily created by automated methods such as rasterizing appropriate font data to produce clean example images. For historical OCR in printing and writing styles distinct from those embodied in modern fonts, appropriate character images must instead be extracted from actual historical documents to achieve good recognition accuracy. For languages with small character sets it may feasible to perform this process manually, but for languages with many thousands of characters, such as Chinese, manually collecting this data is often not practical.

This paper presents an unsupervised method to extract this data from two unaligned, unstructured, and noisy inputs: firstly, a corpus of transcribed documents; secondly, a corpus of scanned documents of the desired printing or writing style, some fraction of which are editions of texts included in the transcription corpus. The unsupervised procedure described is demonstrated capable of using this data, together with an OCR engine trained only on modern printed Chinese to retrain the same engine to recognize pre-modern Chinese texts with a 43% reduction in overall error rate.

[Full paper]

Posted in Chinese, Digital Humanities | Comments Off

Text Tools for ctext.org

This tutorial introduces some of the main functionality of the “Text Tools” plugin for the Chinese Text Project database and digital library along with suggested example tasks and use cases.

[Online version of this tutorial: https://dsturgeon.net/texttools]

Initial setup

  • If you haven’t used the Chinese Text Project before, please refer to the tutorial “Practical introduction to ctext.org” for details of how to create a ctext.org account and install a plugin.
  • Make sure you are logged in to your ctext.org account.
  • If you have an API key, save it into your ctext.org account using the settings page. Alternatively if your institution subscribes to ctext.org and you are not using a computer on your university’s local network, follow your university’s instructions to connect to your their VPN.
  • Install the “Text Tools” plugin (installation link) – you only need to do this once.
  • Once these steps have been completed, when you open a text or chapter of text on ctext.org, you should see a link to the Text Tools plugin.

Getting started

The Text Tools program has a number of different pages (titled “N-gram”, “Regex”, etc.) which can be switched between using the links at the top of the page. Each page corresponds to one of the tools described below, except for the Help page, which explains the basic usage and options for each of the tools. These include tools for textual analysis as well as simple data visualization.

The textual analysis tools are designed to operate on textual data which can either be read in directly from ctext.org via API, or copied into the tool from elsewhere. If you open the tool by using the ctext.org plugin, that text will be automatically loaded and displayed. To load additional texts from ctext, copy the URN for the text (or chapter) into the box labeled “Fetch text by URN” box in the Text Tools window, and click “Fetch”. When the text has loaded, its contents will be displayed along with its title. To add more texts, click “Save/add another text”, then repeat the procedure. The list of currently selected texts is displayed at the top of the window.

N-grams

“N-grams” are sequences of n consecutive textual items, where n is some fixed integer (e.g. n=1, n=3, etc.). The “textual items” are usually either terms (words) or characters; for Chinese in particular, characters are frequently used rather than words because of the difficulty of accurately segmenting Chinese text into a sequence of separate words automatically. For instance, the sentence “學而時習之不亦說乎” contains the following character 3-grams (i.e. unique sequences of exactly three characters): “學而時”, “而時習”, “時習之”, “習之不”, “之不亦”, “不亦說”, “亦說乎”.

The Text Tools “N-gram” function can be used to give a simple overview of various types of word usage in Chinese texts by means of character n-grams. The simplest cases of n-grams are 1-grams, which are simply character occurrence counts or frequencies.

Exercise:
  • Try computing 1-grams for two or three texts from ctext – you will need to set “Value of n” to 1 to do this. To better visualize the trends, use the “Chart” link to plot a bar chart of the raw data. Try this with and without normalization.
  • Repeat with 2- and 3-grams.
  • If you chose texts which ought to be broadly comparable in length, try repeating with two texts of vastly different lengths and/or styles (e.g. 道德經 and 紅樓夢) with and without normalization to demonstrate how this alters the results.

Word clouds are another type of visualization that can be made with this type of data, in which labels are drawn in different sized text proportional to their frequency of occurrence (or, more usually, proportionally to the log of their frequency). Typically word clouds are created from a single text or merged corpus, using either characters or words, however the same principles extend naturally to n-grams (and regular expressions) generally, as well as to multiple texts. In Text Tools, visualizing data for multiple texts causes the data for each distinct text to be displayed in a different color. Similar comments apply regarding normalization: if counts for different texts are not normalized according to length, longer texts will naturally tend to have larger labels.

Exercise:
  • Create word clouds for a single text, and for two or more texts. Experiment with the “Use log scale” setting in the Word cloud tab – it should quickly become clear why a log scale is usually used for word clouds.

Textual similarity

The Similarity tool uses n-gram shingling to identify and visualize text reuse relationships. To use it, first load one or more texts, select any desired options, and click “Run”.

What is identified by this tool are shared n-grams between parts of the specified texts: rather than reporting all n-grams (as the N-gram tool does), this tool only reports n-grams that are repeated in more than one place, and calculates the total number of shared n-grams between pairs of chapters. Thus unlike the N-gram tool (when “minimum count is set to 1″), larger values of n will result in fewer results being reported, because shorter n-grams are more likely to occur in multiple places, while longer ones will be less common, as well as more strongly indicative of a text reuse relationship existing between the items being compared.

There are two tabs within the output for the similarity tool: the “Matched text” tab shows the n-grams which matched, with brighter shades of red corresponding to greater numbers of overlapping n-grams; the “Chapter summary” tab aggregates the counts of matched n-grams between all pairs of chapters.

Exercise:
  • Run the similarity tool on the Analects with n=5.
  • Experiment with the “Constraint” function by clicking on chapter titles to limit the display to passages having parallels with the specified chapter or pair of chapters.
  • Select a few of the matched n-grams by clicking on them; this will result in a different type of constraint showing where exactly that n-gram was matched
  • Text reuse can be visualized as a weighted network graph. You can do this for your n-gram similarity results by clicking the “Create graph” link in the “Chapter summary” tab, then clicking “Draw”.
  • Which chapters of the Analects have the strongest text reuse relationship according to this metric? You can probably see this straight away from the graph, however you can also check this numerically by returning to the Chapter summary tab, and sorting the table by similarity – clicking on the titles of columns in any Text Tools table sorts it by that column (click a second time to toggle sort order).
  • Returning to the graph (you can click the “Network” link at the top of the page to switch pages), the edges of the graph have a concrete meaning defined in terms of identified similarities. Double-clicking on an edge will reopen the Similarity tool, with the specific similarities underwriting the selected edge highlighted. Examine some of the edges using this function, including the thickest and thinnest edges.
  • Experiment with increasing and decreasing the value of n – how does this affect the results?
  • By default, the graph contains edges representing every similarity identified. Particularly for smaller values of n, some of these relationships will not be significant, and this may result in edges being drawn between almost all pairs of nodes in the graph, complicating the picture and obscuring genuine patterns. Experiment with simplifying the graph by setting a threshold (e.g. 0.001) for the “Skip edges with weight less than” setting – this will simplify the graph by removing those edges with relatively small amounts of reuse. Compare this with the results of increasing the value of n in the similarity tool, which will also decrease the number of edges as more trivial similarities are excluded.
  • The Similarity tools also works with multiple texts; if multiple texts are loaded and a graph is created, different colors will be used to distinguish between chapters of different texts. Try this with the Xunzi and the Zhuangzi, two very dissimilar texts which nonetheless do have reuse relationships with one another (this may take a few seconds to run – the similarity tool will take longer for larger amounts of text).

Regular expressions

A regular expression (often shortened to “regex”) is a pattern which can be searched for in a body of text. In the simplest case, a regular expression is simply a string of characters to search for; however by supplementing this simple idea with specially designed syntax, it is possible to express much more complex ways of searching for data.

The regex tool makes it possible to search within one or more texts for one or more regular expressions, listing matched text as well as aggregating counts of results per-text, per-chapter, or per-paragraph.

Exercise:
  • The simplest type of regular expression is simply a character string search – i.e. a list of characters in order which will match (only) that precise sequence of characters – one type of full-text search. Try searching the text of the Analects for something you would expect to appear in it (e.g. “君子”).
  • Examine the contents of the “Matched text” and “Summary” tabs.
  • Add a second search phrase (e.g. “小人”) to your search, and re-run the regex.
  • Re-run the same search again using the same two regular expressions, but changing “Group rows by” from the default “None” to “Paragraph”. When you do this, the “Summary” tab will show one row for every passage in the Analects. Try clicking on a numbered paragraph (these numbers are chosen automatically starting from the beginning of the text) – this will highlight the passage corresponding to that row.

Search results like these can be relational when grouped by a unit such as a paragraph or chapter: if two terms appear together in the same paragraph (or chapter), this can indicate some relationship between the two; if they repeatedly occur together in many paragraphs, this may indicate a stronger relationship between the two in that text. It is thus possible to use a network graph to visualize this information; you can do this in Text Tools by running regular expressions and setting “Group rows by” to “Paragraph”.

Exercise:
  • Search for the terms 父母, 君子, 小人, 禮, and 樂 in the Analects, and construct a network graph based on their co-occurrence in the same paragraphs of text.
  • Double-clicking on an edge in this graph will reopen the Regex tool, with the specific matches underwriting the selected edge highlighted. Examine some of the edges using this function, including the thickest and thinnest edges, to see what data they actually represent.
  • Using the same method but specifying a list of character names (寶玉, 黛玉, 寶釵, etc. – you can get a list of more names from Wikipedia), map out how character names co-occur in paragraphs of the Hongloumeng. Note: you will need to make sure that you choose names frequently used in the actual text (e.g. “賈寶玉” is only infrequently used; “寶玉” is far more common – and will also match cases of “賈寶玉”). This is one example of Social network analysis.
  • When you set “Group rows by” to “None”, you can temporarily add constraints to the “Matched text” view to show only those paragraphs which matched a particular search string. You can set or remove a constraint by clicking on a matched string in the “Matched text view”; you can also click the text label of an item in the “Summary” view to set that item as the constraint, and so see at a glance which paragraphs contained that particular string. Re-run your search with the same terms but in “None” mode, and use this to quickly see which passages the least-frequently occurring name from your list appeared in.

[A word of caution: when performing this type of search, it is important to examine the matched text to confirm whether "too much" may be matched, as well as whether other things may be missed. In the Hongloumeng example above, for instance, although the vast majority of string matches for "寶玉" in the text do indeed refer to 賈寶玉, another character appears later in the novel called "甄寶玉" - these will also match a simple search for the string "寶玉". In this particular example, this can be avoided by constructing a regular expression to avoid these other cases - such as the regex "(?!甄)寶玉", which will match the string "寶玉" only when it does not come immediately after a "甄".]

So far we have only used the simplest type of regular expressions. Regular expressions also allow for the specification of more complex patterns to be matched in the same way as the simple string searches we have just done – for example, the ability to specify a search for a pattern like “以[something]為[something]“, which would match things like “以和為量”, “以生為本”, or “以我為隱”. In order to do this, regular expressions are created by building on any fixed characters we want to match with the addition of “special” characters that describe patterns we are looking for.

Some of the most useful types of special syntax available in regular expressions is summarized in the following table:

. Matches any one character exactly once
[abcdef] Matches any one of the characters a,b,c,d,e,f exactly once
[^abcdef] Matches any one character other than a,b,c,d,e,f
(xyz) Matches xyz, and saves the result as a numbered group.
? After a character/group, makes that character/group optional (i.e. match zero or 1 times)
? After +, * or {…}, makes matching ungreedy (i.e. choose shortest match, not longest)
* After a character/group, makes that character/group match zero or more times
+ After a character/group, makes that character/group match one or more times
{2,5} After a character/group, makes that character/group match 2,3,4, or 5 times
{2,} After a character/group, makes that character/group match 2 or more times
{2} After a character/group, makes that character/group match exactly 2 times
\3 Matches whatever was matched into group number 3 (first group from left is numbered 1)

The syntax may seem complex, but it is quite easy to get started with. For instance, the first special syntax listed in the table above – a dot (“.”) – matches any one character. So the example above of “以[something]為[something]” can be expressed as the regular expression “以.為.”, read as “match the character ‘以’, followed by any one character, followed by ‘為’, followed by any one character”.

Exercise:
  • Try the regex “以.為.” from the example above in the Zhuangzi, using “Group by None”.
  • In the results of this regex search, you will notice that some matches may not correspond to exactly the type of expression we are really looking for. For example, the above regex will also match “以汝為?”, because punctuation characters are also counted as “characters” when matching regular expressions. One way to exclude these matches from the results is to use a negative character class (which matches everything except a specified list of characters) in the regex instead of the “.” operator (which simply matches any character). A corresponding regex for this example is “以[^。?]為[^。?]” – try this and confirm that it excludes these cases.
  • Because there are many possible punctuation characters, within Text Tools you can also use the shorthand “\W” (upper-case) to stand for any commonly used Chinese punctuation character, and “\w” (lower-case) for any character other than commonly used Chinese punctuation. You should get the same result if you try the previous regex written instead as “以\w為\w”. (Although this is a common convention for English regexes, “\w” and “\W” work slightly differently in different regex implementations and many do not support this for Chinese).
  • Write and test regular expressions to match the following in the Daodejing (ctp:dao-de-jing):
    • Any four characters where the middle is “之不” – i.e. “視之不見”, “聽之不聞”, etc.

Repetition

Repetition can be accomplished using various repetition operators and modifiers listed in the table above.

  • We can ask that any part of our regular expression be repeated some number of times using the “{a,b}” operator. This modifies the immediately preceding item in the regex (e.g. a specification of a character, or a group), requiring it to be repeated at least a times and at most b times (or any number of times, if b is left blank). If we omit the comma and just write “{a}”, this means that the preceding item must be repeated exactly a times.
  • For example, “仁.{0,10}義” will match the character “仁”, followed by anything from 0 to 10 other characters, followed by the character “義” – it will therefore match things like “仁義”, “仁為之而無以為;上義”, “仁,託宿於義”, etc.
  • The same method works with groups, and requires that the pattern specified by the group (not its contents) be repeated the specified number of times. So for instance “(人.){2,}” will match “人來人往”, “人前人後”, and also “人做人配人疼”.
  • The “+”, “*”, and “?” operators work in exactly the same way as this after a character or group: “+” is equivalent to “{1,}”, “*” to “{0,}”, and “?” to “{0,1}”. (They are, however, frequently used because they are shorter to write.)
Exercise:
  • Try the two specific examples described above (i.e. “仁.{0,10}義” and “(人.){2,}”).
  • Write and test regular expressions to match the following in the Daodejing (ctp:dao-de-jing):
    • Each “phrase” (i.e. punctuated section) of text. In other words, the first match should be “道可道”, the second should be “非常道”, and so on.
    • Match each phrase which contains the term “之” in it.
    • Match each phrase which contains the term “之” in it, but neither as the first character nor as the last.
  • Write and test regular expressions to match the following in the Mozi (ctp:mozi):
    • Any occurrences of the character “君” followed anywhere later in the same sentence by “父” (e.g. “君父”, “…君臣父…”, “君臣上下長幼之節,父…”, etc.).

Groups

Aside from repetition, a lot of the power of regular expressions comes from the ability to divide parts of a match into what are called “groups”, and express further conditions using the matched contents of these groups. This makes it possible to express much more sophisticated patterns.

  • Suppose we want to look for expressions like “君不君, “臣不臣”, “父不父”, etc. – cases where we have some character, followed by a “不”, then followed by that same character from before (i.e. we aren’t trying to match things like “人不知”).
  • We can do this by “capturing” the first character – whatever it may be – in a group, and then requiring later in our expression that we match the contents of that group again in another place.
  • Capturing something in a group is accomplished by putting parentheses around the part to capture – e.g. “(.)” matches any character and captures it in a group.
  • Groups are automatically numbered starting from 1, beginning with the leftmost opening bracket, and moving through our regex from left to right.
  • We can reference the contents of a matched group using the syntax “\1″ to match group 1, “\2″ to match group 2, etc.
  • So in our example, “(.).\1″ matches any character, followed by any character, followed by the first character again (whatever it was). Try this on the text of the Analects, then try modifying the regex so that it only matches non-punctuation characters (i.e. does not match things like “本,本”).

Another example is a common type of patterned repetition such as “禮云禮云” and “已乎已乎”. In this case, we can use exactly the same approach. One way is to write “(..)\1″ – match any two characters, then match those same two characters again; another (equivalent) way is to use two separate groups and write “(.)(.)\1\2″ – match any character X, then any character Y, then match X again and then Y again.

Exercise:
  • Write and test a regular expression which matches things like “委委佗佗”, “戰戰兢兢”, etc. in the Book of Poetry (ctp:book-of-poetry).
  • Write and test a regular expression which matches complex repetition of the style “XYZ,ZYX” in the Zhuangzi, where each of X, Y, and Z can be 1-5 characters long. Your regex should match things like “知者不言,言者不知”, “道無以興乎世,世無以興乎道”, and “安其所不安,不安其所安”.

Regex replace

The replace function works in a similar way to the regex search function: this function searches within one specified text for a specified regular expression, and replaces all occurrences of it with a specified value. Although the replacement can be a simple string of characters, it can also be designed to vary depending upon the contents of the regular expression. Specifically, anything that has been matched as a group within the search regex can be referenced in the replacement by using the syntax “$1″ to include the text match in group 1, “$2″ for group 2, etc. One common use case for regex replacements is to “tidy up” data obtained from some external source, or preparing it for use in some particular procedure.

For example:

  • Replacing “\W” with “” (an empty string) will delete all punctuation and line breaks from a text
  • Replacing “^(\w{1,20})$” with “*$1″ will add title markers to any lines which contain between 1 and 20 characters, none of which are punctuation characters – this can be useful when importing non-ctext texts.

Identifying differences between versions

The “Diff” tool provides a simple way of performing a character-by-character “Diff” of two similar pieces of text. Unlike the Similarity tool, this tool works best on input texts which are almost (but not quite) identical to one another.

Try using the Diff tool to compare the contents of the 正統道藏 edition of the 太上靈寶天尊說禳災度厄經 (ctp:wb882781) with the 重刊道藏輯要 edition of the same text (ctp:wb524325).

Network graphs

When you create a graph using the regular expression or similarity tools, the data is exported into the Network tab. For navigation instructions, refer to the “Help” tab. Graphs in the network tab can be entered in a subset of the “GraphViz” format; the graphs created by the other tabs can all be downloaded in this same format. If you would like a more flexible way of creating publication quality graphs, you can download and install Gephi (https://gephi.org/), which is also able to open these files.

Using other texts

Chinese texts from other sources besides ctext.org can be used with Text Tools. For instructions on how to prepare these, refer to the section on Loading texts on the Help page.

Creative Commons License
Posted in Chinese, Digital Humanities | Comments Off