Crowdsourcing the Historical Record: Creating Linked Open Data for Chinese History at Scale

International Journal of Humanities and Arts Computing

Abstract

An important part of the historical record of premodern China is recorded in historical works such as the standard dynastic histories. These works are a key source of knowledge about many aspects of premodern Chinese civilization, including persons, events, bureaucratic structures, literature, geography and astronomical observations. While many such sources have been digitized, typically these digitized texts encode only literal textual content and do not attempt to model the semantic content of the text. Similarly, while some of the historical data contained in some of these sources has been entered into specialist scholarly databases, an even greater proportion of the information does not yet exist in any machine-readable form. Producing such a machine-readable dataset of these materials requires the effort of many individuals working together due to the large scale of the task. This article introduces a crowdsourced approach in which annotation and knowledge base construction are carried out in parallel, with a knowledge base continually expanded through multi-user contributions to textual annotation immediately and automatically feeding back to provide improved assistance with subsequent annotation. The resulting knowledge base is dynamically exposed through Linked Open Data interfaces, creating a continually expanding machine-readable dataset covering around 3,000 years of recorded history.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

ctext Data Wiki tutorial

Getting started

The Data Wiki is a crowdsourced graph database that contains information about entities (people, places, offices, written works, etc.) that are mentioned in premodern Chinese texts. Each “page” of the datawiki lists information about one entity – for example, 王安石 Wang Anshi, 樞密使 Shumishi, or the 明史 History of the Ming.

Every entity in the Data Wiki has its own unique identifier – this always consists of a string beginning with “ctext:”, followed by some sequence of numbers. This is important, because it allows the system to distinguish between things that can be referred to by the same name, and treat equivalently references to the same thing when different names are used for it. For example, the name “大順” could refer to either an era of the Tang dynasty, another possible name for the 天順 era of the Vietnam’s Lý dynasty, or the short-lived dynasty of 李自成.

Exercise:

The data wiki always displays the identifier for an entity at the top of the page, just below its title. You can search for entities by typing in the name an entity is referred to in the “Data Search” box at the left-hand side of the screen.

  • What is the identifier for office of 樞密使 Shumishi?
  • What is the identifier for the Ming dynasty person 楊俊 who died in 1457?

Properties and qualifiers

Apart from the identifier, all other data in the Data Wiki consists of “claims” about entities. Each claim connects:

  1. An entity (the subject of the claim)
  2. A property: an entity, which must first itself be defined (list of current properties)
  3. An object: either an entity, or a literal (usually a string, or a date)

Each row of an entity record (i.e. the table displayed when you look up an entity) represents one “claim” about that entity.

For example, the claim that the father of 諸葛亮 Zhuge Liang was 諸葛珪 Zhuge Gui is represented by the a claim using the “father” property, connecting:

  1. Subject: 諸葛亮 Zhuge Liang
  2. Property: father
  3. Object: 諸葛珪 Zhuge Gui

This corresponds to a machine-readable triple connecting three entities: ctext:82307 ctext:539391 ctext:167600 . This is closely related (though not identical) to the RDF representation of the same claim.

Each claim may also have one or more qualifiers (list of current qualifiers), each of which is again paired with a corresponding object. This allows additional contextual information to be added to a specific claim. For example, we might have a claim that 王珪 held the office of 參知政事:

  1. Subject: 王珪
  2. Property: held-office
  3. Object: 參知政事

If we also know from what date he held the office, we can qualify the claim with the qualifier “from-date” with the date from which he held the office as its object:

  1. Qualifier: from-data
  2. Object: 熙寧三年十二月丁卯

You can see how this claim – together with this qualifier to it – is displayed in the entity record for 王珪.

Exercise:
  • Write a query to list all people who have held an office with a title “…總督”
Exercise:
  • Write a query to list all written works that are indexed in the 清史稿
  • Modify the query to list instead:
    1. all works that are indexed in both 清史稿 and 四庫全書總目提要
    2. all works that are indexed in the 清史稿 but not in 四庫全書總目提要
    3. all works that are indexed in the 清史稿 and mentioned in the text of the 四庫全書總目提要 – does this give the same result as query (a)?

Working with dates

The Data Wiki (and other components of ctext.org) understand Chinese dates and can convert them to and from Gregorian/Julian calendar dates.

Exercise:
  • Try searching the Data Wiki for an historical date, such as “大順元年” or “天成四年五月四日”. Try clicking through the two alternative choices given in the “Resolved date”, “Era/ruler”, and “Associated rulers” columns
  • Experiment with other date references

Although dates in this system are not strictly entities, they have analogous identifiers that begin “date:…”. These distinguish explicitly between different eras that happen to share the same name, and record the content of dates in different historically used formats in ways that make them machine processable. For example, in the exercise above, “天成四年五月四日” is ambiguous – even if we exclude the possibility of one of the three candidate eras on the grounds that it had no fourth year. The two possiblities have different identifiers: date:794498/4/5/+4 (if we mean the Later Tang/Min 天成), and date:587624/4/5/+4 (if we mean the Lý dynasty 天成).

It’s important to note that “the same day” can be referenced in different ways, and these have different identifiers. For example, we could specify these same dates equivalently (but with different semantics) using 干支 for the year, day, or both:

All four representations have different identifiers that directly mirror their semantics, but will be treatedly as equivalent dates when searching texts because they resolve to the same day in history.

Exercise:

By looking up the data for 安祿山:

  1. According to the record in the 新唐書, in what month did 安祿山 rebel against the Tang dynasty?
  2. Using this information, locate the passages in the 新唐書 and 舊唐書 that explicitly reference a) this month, and b) this year.
Creative Commons License
Posted in Uncategorized | Comments Off

ESSCS workshop

Setup:

  • Recommended web browsers: Firefox or Chrome; Safari and Edge should also work for most tasks, but have not been fully tested.
  • Create a ctext account and log in
  • Check your e-mail (and spam folder) for an e-mail sent from the system, and click the link in the e-mail to validate your account.
  • Go to “Settings” at the bottom left, enter the API key specified in the live session in the box under “API key”, and click “Save”
  • Install the Text Tools plugin by opening this link, and then clicking “Install”
  • Install the Annotation plugin by opening this link, and then clicking “Install”

Tutorials

Some parts of the material that will be covered in the session are available in step-by-step tutorials, which also include other details and examples and might be useful if you want to come back to the material later:

  1. Practical introduction to ctext.org – interactive guide to core functionality of the Chinese Text Project.
  2. Text Tools for ctext.org – interactive guide to using the Text Tools plugin for the Chinese Text Project for text mining and data visualization.
  3. ctext.org Data Wiki tutorial – interactive guide to working with the Data wiki
  4. the posts on text reuse and regular expressions on Digital Sinology.

For those interested in using ctext with Python (or another programming language), see also:

These parts of the ctext.org instructions should also be useful:

Lastly, some of these papers may be of interest:

Local (non-ctext) annotation example
Download and save the file: qidan-guozhi1.xml (契丹國志卷一). You can load this into the Annotation client without using the ctext.org API.

Using materials not in classical Chinese

If we have time, we’ll quickly look at how this works in Text Tools. For simplicity, here are links to two very simple sets of example data:

  1. English example: alice_in_wonderland.txt and alice_adventures_underground.txt
  2. Modern Chinese: articles.zip: a zip file containing a few modern newspaper articles
Posted in Uncategorized | Comments Off

Constructing a crowdsourced linked open knowledge base of Chinese history

Paper presented at Pacific Neighborhood Consortium Conference 2021

Abstract

This paper introduces a crowdsourced approach to knowledge base construction for historical data based upon annotation of historical source materials. Building on an existing digital library of premodern Chinese texts and adapting techniques from other annotation and knowledge base projects, this lays the groundwork for a scalable, sustainable, linked open repository of data covering around 3000 years of recorded Chinese history.

[Full text via IEEE (paywall)]

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

AAS 2021

Zoom link: https://durhamuniversity.zoom.us/j/91336059453?pwd=azdKZ0srR0c5Z0VsNEp6bzJOV3ZuQT09

Setup:

  • Recommended web browsers: Firefox or Chrome; Safari and Edge should also work for most tasks, but have not been fully tested.
  • Create a ctext account and log in
  • Check your e-mail (and spam folder) for an e-mail sent from the system, and click the link in the e-mail to validate your account.
  • Go to “Settings” at the bottom left, enter the API key specified in the live session in the box under “API key”, and click “Save”
  • Install the Text Tools plugin by opening this link, and then clicking “Install”
  • Install the Annotation plugin by opening this link, and then clicking “Install”

Some parts of the material that will be covered in the session are available in step-by-step tutorials, which also include other details and examples and might be useful if you want to come back to the material later:

These parts of the ctext.org instructions should also be useful:

Lastly, some of these papers may be of interest:

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Crowdsourcing Chinese history: distributed transcription, annotation, and datafication

Poster presented at Linked Pasts 2020

For more details, please see the extended abstract “Crowdsourcing the historical record: knowledge base construction for Chinese history at scale” presented at Taiwan’s DADH 2020 conference.

Abstract

The tasks of semantically annotating historical primary source materials and systematically recording knowledge about historical entities have closely connected conceptual relationships. Annotations can be leveraged to extract knowledge about entities, and knowledge about entities can be leveraged to aid in the efficient annotation of texts. Both tasks also rely in practice upon accurate transcriptions of primary source materials, which – like annotations and structured knowledge – are costly to produce in the first instance. This poster describes a crowdsourced approach in which all three tasks are carried out in parallel, involving not only the distributed transcription of texts, but also the creation of a knowledge base continually expanded through user contributions made partly through annotation, immediately and automatically contributing to improved automated assistance with ongoing and future crowdsourced annotation.

Posted in Chinese, Digital Humanities, Talks and conference papers | Comments Off

Digitizing Premodern Text with the Chinese Text Project

Paper published in Journal of Chinese History

Abstract

The widespread availability of digitized premodern textual sources – together with increasingly sophisticated means for their manipulation – has brought enormous practical benefits to scholars whose work relies upon reference to their contents. While great progress has been made with the construction of ever more comprehensive database systems and archives, far more remains not only possible but also realistically achievable in the near future. This paper discusses some of the key challenges faced, and progress made towards solving them, in the context of a widely used open digital platform attempting to expand the range of digitized sources available while simultaneously increasing the scope of meaningful tasks that can be performed with them computationally. This paper aims to suggest how seemingly simple human-mediated additions to the digitized historical record – when combined with the power of digital systems to repeatedly perform mechanical tasks at enormous scales – quickly lead to transformative changes in the feasible scope of computational analysis of premodern writing.

Full text through publisher site (PDF, paywall) / Free online version (full text but no PDF)

Part of a JCH Special Issue on Digital Humanities.

Posted in Chinese, Digital Humanities | Comments Off

MARAAS workshop

Materials from a workshop held as part of the MARAAS Conference: Asian Studies in the Digital Age at Dickinson College, Carlisle, PA. [Download slides]

Setup

  • Create a free account on ctext.org and log in.
  • Make sure to validate your e-mail address by opening the link the system sent you (if not, the link above will display a warning/reminder in red to do so).
  • Enter the API key in the box labeled “API key”, and click “Save”.

We will follow parts of the “Practical introduction to ctext.org” and “Text Tools for ctext.org” tutorials with a few changes and a few new features not yet included in the tutorials.

Link to Text Tools: http://ctext.org/plugins/texttools/#help

Important note/reminder: For tools which have this option, we will use “Tokenize by character” set to “On” for the Chinese materials, and “Off” for the English ones.

Overviews of functionality

The following give some basic illustrations of what can be done in Text Tools through concrete examples:

Other suggested examples

As well as the examples shown in the tutorials:

  • To see how the tool works with tokenized materials, download the following English text files (e.g. right-click each link and choose “Save as”):
  • Try some regexes on the English examples. A useful expression is likely to be “\w+” – any sequence of non-punctuation characters (intuitively, a word). Try as an example “the \w+”.
  • Using the “English_wordlist.txt” file as a list of regexes (just paste the contents of the file into the “Regex” box), generate vectors for the two Wizard of Oz stories. Run PCA on the results – you should see interesting differences between the two. Also try preprocessing the data by tokenizing and lowercasing the texts.
  • Try tokenizing one or more modern Chinese documents [example].
  • Using the regex tool with “Group rows by” set to “None” and “Extract groups” checked, try extracting data about biographies in the 宋史. You may want to start by using a small part of the text, e.g. ctp:ws55241. Example regex: (\w+),字(\w+),(\w+)人。

Related research

Posted in Digital Humanities, Video | Comments Off

Durham Institute for Data Science (IDAS) launch

It was a pleasure to take part in the Durham Institute for Data Science (IDAS) launch event.

The slides from my talk, Interactive text mining and visualization in the humanities, are available online.

Posted in Digital Humanities, Talks and conference papers | Comments Off

Old texts in a new world: Meaning production in the digital medium

Paper presented at Materiality of Knowledge in Chinese Thought: Past and Present, Yuelu Academy

Abstract

Throughout history, technical innovations in the production and transmission of written materials have often had far-reaching long-term consequences for knowledge production – from the standardization of writing forms, to the development of dictionaries and encyclopedias, to the availability and spread of printing and copying technologies. In this paper, I focus on the ongoing impact of the most recent such development: digitization and increasing use of digital modes of interaction with premodern textual materials.

Since premodern Chinese documents first became available to scholars in digital form, the existence of digital texts has caused gradual but significant changes in mainstream scholarly workflows and expectations. Full-text repositories and digital libraries now make available in seconds to anyone on the planet premodern materials on a scale once impossible for anyone other than a determined emperor to obtain, while making similarly fantastic reductions in time and effort required to retrieve certain types of information. At the same time, even more dramatic changes have begun to take place as a consequence of digitization together with the ever-increasing sophistication and power of digital systems. Faced with larger volumes of material than any individual could ever expect to read – let alone claim detailed knowledge of – text mining and distant reading approaches offer the promise of gleaning useful information from exhaustive statistical analyses at scales not achievable through traditional means. Data-driven approaches – already well developed in other disciplines – similarly enable digital approaches to historical studies in which evidence can be systematically assembled at large enough scales to solidly ground statistical claims about broad historical and societal changes over time. This paper explores the development of these approaches, and the consequences for knowledge production in the digital age.

Posted in Chinese, Talks and conference papers | Comments Off