Donald Sturgeon

D-SEA Workshop

Posted on July 9, 2024 by dsturgeon

A workshop held as part of Charting the European D-SEA: Digital Scholarship in East Asian Studies.

Setup:

Recommended web browsers: Firefox or Chrome; Safari and Edge should also work for most tasks, but have not been fully tested.
Create a ctext account and log in
Check your e-mail (and spam folder) for an e-mail sent from the system, and click the link in the e-mail to validate your account.
Go to “Settings” at the bottom left, enter the API key specified in the live session in the box under “API key”, and click “Save”
Install the Text Tools plugin by opening this link, and then clicking “Install”
Install the Annotation plugin by opening this link, and then clicking “Install”

Some parts of the material that will be covered in the session are available in step-by-step tutorials, which also include other details and examples and might be useful if you want to come back to the material later:

Practical introduction to ctext.org – interactive guide to core functionality of the Chinese Text Project.
Text Tools for ctext.org – interactive guide to using the Text Tools plugin for the Chinese Text Project for text mining and data visualization.
Data Wiki tutorial – interactive guide to using the Data Wiki.
the posts on text reuse and regular expressions on Digital Sinology.
SPARQL querying for ctext.org data – an introduction to querying ctext data using the industry standard SPARQL query language.
Classical Chinese Digital Humanities (on Digital Sinology) – step-by-step guide to getting started with programming in Python, and accessing the CTP API for simple text mining.

These parts of the ctext.org instructions should also be useful:

Some additional details and examples of the techniques used are available in the slides for “PKU Workshop 2023” (Chinese),

Lastly, some of these papers may be of interest – and please consider citing one or more of these if you use the system or its contents in your work (note that as the creator of ctext.org, I get no academic credit for any of the work that went into this project when it is cited solely by its URL):

Digitizing Premodern Text with the Chinese Text Project, Journal of Chinese History 2020, 4(2).
Chinese Text Project: a dynamic digital library of premodern Chinese, Digital Scholarship in the Humanities (2019)
Digital Approaches to Text Reuse in the Early Chinese Corpus, Journal of Chinese Literature and Culture 2018, 5(2).
Large-scale Optical Character Recognition of Pre-modern Chinese Texts, International Journal of Buddhist Thought and Culture 2018, 28(2).
Unsupervised Identification of Text Reuse in Early Chinese Literature, Digital Scholarship in the Humanities (2018)

Comments Off

CUHK workshop 2024

Posted on April 24, 2024 by dsturgeon

Useful links:

Demo materials to download:

PKU workshop 2023

Posted on August 2, 2023 by dsturgeon

投影片下载

正则表达式
 互文性和文体分析

Demo用的文本集

論語
 墨子、荀子、莊子
 投影上非常简单的向量例子
 明代小說集
 Alice in Wonderland
Merchant of Venice etc.

其它相关连接

https://text.tools/ctext/ 文本分析工具，直接在浏览器中使用。建议用Chrome或Firefox。

课后自习体验：
https://dsturgeon.net/texttools/
https://digitalsinology.org/text-tools/
https://digitalsinology.org/text-tools-regex/
https://digitalsinology.org/classical-chinese-dh-regular-expressions/

Comments Off

CUHK workshop

Posted on April 21, 2023 by dsturgeon

[Download slides]

Setup:

Recommended web browsers: Firefox or Chrome; Safari and Edge should also work for most tasks, but have not been fully tested.
Create a ctext account and log in
Check your e-mail (and spam folder) for an e-mail sent from the system, and click the link in the e-mail to validate your account.
Go to “Settings” at the bottom left, enter the API key specified in the live session in the box under “API key”, and click “Save”
Install the Text Tools plugin by opening this link, and then clicking “Install”
Install the Annotation plugin by opening this link, and then clicking “Install”

Practical introduction to ctext.org – interactive guide to core functionality of the Chinese Text Project.
Text Tools for ctext.org – interactive guide to using the Text Tools plugin for the Chinese Text Project for text mining and data visualization.
Data Wiki tutorial – interactive guide to using the Data Wiki.
the posts on text reuse and regular expressions on Digital Sinology.
SPARQL querying for ctext.org data – an introduction to querying ctext data using the industry standard SPARQL query language.
Classical Chinese Digital Humanities (on Digital Sinology) – step-by-step guide to getting started with programming in Python, and accessing the CTP API for simple text mining.

These parts of the ctext.org instructions should also be useful:

Digitizing Premodern Text with the Chinese Text Project, Journal of Chinese History 2020, 4(2).
Chinese Text Project: a dynamic digital library of premodern Chinese, Digital Scholarship in the Humanities (2019)
Digital Approaches to Text Reuse in the Early Chinese Corpus, Journal of Chinese Literature and Culture 2018, 5(2).
Large-scale Optical Character Recognition of Pre-modern Chinese Texts, International Journal of Buddhist Thought and Culture 2018, 28(2).
Unsupervised Identification of Text Reuse in Early Chinese Literature, Digital Scholarship in the Humanities (2018)

Comments Off

SPARQL querying for ctext.org data

Posted on April 21, 2023 by dsturgeon

The Chinese Text Project includes a Data Wiki, which creates and organizes machine-readable data about premodern entities such as people, written works, bureacratic offices, places, etc.

While this data can be searched from within the user interface itself, a more powerful and flexible method of searching is to query the data using the W3C standard query language SPARQL. This interface can be accessed at: https://sparql.ctext.org/; it is also possible to download the same data and run queries locally on your own computer using any graph database that supports RDF and SPARQL.

Referring to entities in RDF

There are two ways of querying data in the ctext RDF graph. The first, simpler approach uses only properties, entities, and literal values.

Prefix	Equivalent Wikidata prefix	Semantics
ctext:	wd:	Refers to an entity
claim:	wdt:	Refers to a property of a subject

For example, the RDF serialization of the data for 司馬光 (ctext:506404) contains the claim that the father of 司馬光 was 司馬池 (ctext:439713). This is expressed in the RDF data as:

ctext:506404 claim:father ctext:439713 .

Note that properties when used in this way (i.e. to connect a subject directly to an object) use the “claim:” prefix.
We can use this approach to query statements in SPARQL, by adding vairables for those pieces of data we want returned. For instance, we could ask for a list of all of the works that ctext knows about created by 司馬光 with the query:

SELECT * WHERE {
  ?work claim:creator ctext:506404 .
}

For convenience, the ctext RDF graph uses “rdfs:label” to assign exactly one label to each entity (i.e. the “default” name in the Data Wiki). So to show the names of the works in question, we can add an additional line to the SPARQL query:

SELECT * WHERE {
  ?work claim:creator ctext:506404 .
  ?work rdfs:label ?worktitle .
}

The same approach can be used for all other properties; for example, listing the titles that 司馬光 has held can be done in exactly the same way:

SELECT * WHERE {
  ctext:506404 claim:held-office ?office .
  ?office rdfs:label ?officetitle .
}

A limitation of this simple approach to querying is that it does not allow access to qualifiers (e.g. the “from-date” qualifier, used to state from what date a person held a particular title). The second, more flexible approach to querying allows this information to be included, by using an intermediary node between the subject and object, and connecting this intermediary node to any qualifiers present. Note that this representation of the data uses a different prefix for properties than the first approach.

Prefix	Equivalent Wikidata prefix	Semantics
cstat:	p:	Refers to a statement of a subject
cprop:	ps:	Refers to a property of a statement
cqual:	pq:	Refers to a qualifier of a statement

In this representation, the RDF serialization for 司馬光 (ctext:506404)‘s father being 司馬池 is now expressed using two edges and a blank node:

ctext:506404 cstat:father [
  cprop:father ctext:439713
] .

The advantage of this representation is that it is possible to record (and therefore query) qualifiers as well as properties. For instance, the claim that 司馬光 held the title of 資政殿學士 (ctext:179992) from the date 元豐七年十二月戊辰 (date:562206.7.12.5) is expressed as follows:

ctext:506404 cstat:held-office [
  cprop:held-office ctext:179992 ;
  cqual:from-date date:562206.7.12.5
] .

This means we can query this information also. To start with, we can reproduce the simple query to show titles held using this alternative representation:

SELECT * WHERE {
  ctext:506404 cstat:held-office ?statement .
  ?statement cprop:held-office ?office .
  ?office rdfs:label ?officetitle .
}

Now we can additionally ask for the from-date qualifier value:

SELECT * WHERE {
  ctext:506404 cstat:held-office ?statement .
  ?statement cprop:held-office ?office .
  ?office rdfs:label ?officetitle .
  ?statement cqual:from-date ?fromdate .
  ?fromdate rdfs:label ?fromdatedesc .
}

Dates in the ctext RDF graph are themselves nodes containing additional information (please refer to the RDF itself for detailed examples). We can additionally have our query output the Julian/Gregorian year/month/date of the dates in question, by requesting the “time:hasBeginning” (and, if we want to be precise, also the “time:hasDuration”) edges:

SELECT * WHERE {
  ctext:506404 cstat:held-office ?statement .
  ?statement cprop:held-office ?office .
  ?office rdfs:label ?officetitle .
  ?statement cqual:from-date ?fromdate .
  ?fromdate rdfs:label ?fromdatedesc .
  ?fromdate time:hasBeginning ?fromdateymd .
}

Comments Off

AAS 2023

Posted on March 17, 2023 by dsturgeon

[Download slides]

Setup:

Recommended web browsers: Firefox or Chrome; Safari and Edge should also work for most tasks, but have not been fully tested.
Create a ctext account and log in
Check your e-mail (and spam folder) for an e-mail sent from the system, and click the link in the e-mail to validate your account.
Go to “Settings” at the bottom left, enter the API key specified in the live session in the box under “API key”, and click “Save”
Install the Text Tools plugin by opening this link, and then clicking “Install”
Install the Annotation plugin by opening this link, and then clicking “Install”

Practical introduction to ctext.org – interactive guide to core functionality of the Chinese Text Project.
Text Tools for ctext.org – interactive guide to using the Text Tools plugin for the Chinese Text Project for text mining and data visualization.
Data Wiki tutorial – interactive guide to using the Data Wiki.
the posts on text reuse and regular expressions on Digital Sinology.
Classical Chinese Digital Humanities (on Digital Sinology) – step-by-step guide to getting started with programming in Python, and accessing the CTP API for simple text mining.

These parts of the ctext.org instructions should also be useful:

Digitizing Premodern Text with the Chinese Text Project, Journal of Chinese History 2020, 4(2).
Chinese Text Project: a dynamic digital library of premodern Chinese, Digital Scholarship in the Humanities (2019)
Digital Approaches to Text Reuse in the Early Chinese Corpus, Journal of Chinese Literature and Culture 2018, 5(2).
Large-scale Optical Character Recognition of Pre-modern Chinese Texts, International Journal of Buddhist Thought and Culture 2018, 28(2).
Unsupervised Identification of Text Reuse in Early Chinese Literature, Digital Scholarship in the Humanities (2018)

Comments Off

Crowdsourcing the Historical Record: Creating Linked Open Data for Chinese History at Scale

Posted on November 10, 2022 by dsturgeon

International Journal of Humanities and Arts Computing

Abstract

An important part of the historical record of premodern China is recorded in historical works such as the standard dynastic histories. These works are a key source of knowledge about many aspects of premodern Chinese civilization, including persons, events, bureaucratic structures, literature, geography and astronomical observations. While many such sources have been digitized, typically these digitized texts encode only literal textual content and do not attempt to model the semantic content of the text. Similarly, while some of the historical data contained in some of these sources has been entered into specialist scholarly databases, an even greater proportion of the information does not yet exist in any machine-readable form. Producing such a machine-readable dataset of these materials requires the effort of many individuals working together due to the large scale of the task. This article introduces a crowdsourced approach in which annotation and knowledge base construction are carried out in parallel, with a knowledge base continually expanded through multi-user contributions to textual annotation immediately and automatically feeding back to provide improved assistance with subsequent annotation. The resulting knowledge base is dynamically exposed through Linked Open Data interfaces, creating a continually expanding machine-readable dataset covering around 3,000 years of recorded history.

Full paper: publisher site / preprint

Comments Off

ctext Data Wiki tutorial

Posted on June 20, 2022 by dsturgeon

Getting started

The Data Wiki is a crowdsourced graph database that contains information about entities (people, places, offices, written works, etc.) that are mentioned in premodern Chinese texts. Each “page” of the datawiki lists information about one entity – for example, 王安石 Wang Anshi, 樞密使 Shumishi, or the 明史 History of the Ming.

Every entity in the Data Wiki has its own unique identifier – this always consists of a string beginning with “ctext:”, followed by some sequence of numbers. This is important, because it allows the system to distinguish between things that can be referred to by the same name, and treat equivalently references to the same thing when different names are used for it. For example, the name “大順” could refer to either an era of the Tang dynasty, another possible name for the 天順 era of the Vietnam’s Lý dynasty, or the short-lived dynasty of 李自成.

Exercise:

The data wiki always displays the identifier for an entity at the top of the page, just below its title. You can search for entities by typing in the name an entity is referred to in the “Data Search” box at the left-hand side of the screen.

What is the identifier for office of 樞密使 Shumishi?
What is the identifier for the Ming dynasty person 楊俊 who died in 1457?

Properties and qualifiers

Apart from the identifier, all other data in the Data Wiki consists of “claims” about entities. Each claim connects:

An entity (the subject of the claim)
A property: an entity, which must first itself be defined (list of current properties)
An object: either an entity, or a literal (usually a string, or a date)

Each row of an entity record (i.e. the table displayed when you look up an entity) represents one “claim” about that entity.

For example, the claim that the father of 諸葛亮 Zhuge Liang was 諸葛珪 Zhuge Gui is represented by the a claim using the “father” property, connecting:

Subject: 諸葛亮 Zhuge Liang
Property: father
Object: 諸葛珪 Zhuge Gui

This corresponds to a machine-readable triple connecting three entities: ctext:82307 ctext:539391 ctext:167600 . This is closely related (though not identical) to the RDF representation of the same claim.

Each claim may also have one or more qualifiers (list of current qualifiers), each of which is again paired with a corresponding object. This allows additional contextual information to be added to a specific claim. For example, we might have a claim that 王珪 held the office of 參知政事:

Subject: 王珪
Property: held-office
Object: 參知政事

If we also know from what date he held the office, we can qualify the claim with the qualifier “from-date” with the date from which he held the office as its object:

Qualifier: from-date
Object: 熙寧三年十二月丁卯

You can see how this claim – together with this qualifier to it – is displayed in the entity record for 王珪.

Within the user interface, we can query for entities that have a particular property value. This is done by entering a property name, followed by a colon, followed by the desired value. The value can be either a string, or an entity reference – e.g.:

held-office:協辦大學士

or alternatively:

held-office:ctext:319155

Note that “%” can be used as a wildcard to match any string – for instance to match any entity with “目錄” as part of its name, we can use this query:

name:%目錄%

Exercise:

Write a query to list all people who have held an office with a title “…總督”

Exercise:

Write a query to list all written works that are indexed in the 清史稿
Modify the query to list instead:
1. all works that are indexed in both 清史稿 and 四庫全書總目提要
2. all works that are indexed in the 清史稿 but not in 四庫全書總目提要
3. all works that are indexed in the 清史稿 and mentioned in the text of the 四庫全書總目提要 – does this give the same result as query #1?

Working with dates

The Data Wiki (and other components of ctext.org) understand Chinese dates and can convert them to and from Gregorian/Julian calendar dates.

Exercise:

Try searching the Data Wiki for an historical date, such as “大順元年” or “天成四年五月四日”. Try clicking through the two alternative choices given in the “Resolved date”, “Era/ruler”, and “Associated rulers” columns
Experiment with other date references

Although dates in this system are not strictly entities, they have analogous identifiers that begin “date:…”. These distinguish explicitly between different eras that happen to share the same name, and record the content of dates in different historically used formats in ways that make them machine processable. For example, in the exercise above, “天成四年五月四日” is ambiguous – even if we exclude the possibility of one of the three candidate eras on the grounds that it had no fourth year. The two possiblities have different identifiers: date:794498/4/5/+4 (if we mean the Later Tang/Min 天成), and date:587624/4/5/+4 (if we mean the Lý dynasty 天成).

It’s important to note that “the same day” can be referenced in different ways, and these have different identifiers. For example, we could specify these same dates equivalently (but with different semantics) using 干支 for the year, day, or both:

(if we mean the Later Tang/Min 天成)
1. date:794498/4/5/+4 天成四年五月四日 = 929/6/13
2. date:794498/4/5/9 天成四年五月壬申 = 929/6/13
3. date:794498/g26/5/+4 天成己丑年五月四日 = 929/6/13
4. date:794498/g26/5/9 天成己丑年五月壬申 = 929/6/13
(if we mean the Lý dynasty 天成)
1. date:587624/4/5/+4 天成四年五月四日 = 1031/5/27
2. date:587624/4/5/9 天成四年五月壬申 = 1031/5/27
3. date:587624/g26/5/+4 天成己丑年五月四日 = 1031/5/27
4. date:587624/g26/5/9 天成己丑年五月壬申 = 1031/5/27

All four representations have different identifiers that directly mirror their semantics, but will be treatedly as equivalent dates when searching texts because they resolve to the same day in history.

Exercise:

By looking up the data for 安祿山:

According to the record in the 新唐書, in what month did 安祿山 rebel against the Tang dynasty?
Using this information, locate the passages in the 新唐書 and 舊唐書 that explicitly reference a) this month, and b) this year.

Comments Off

ESSCS workshop

Posted on June 20, 2022 by dsturgeon

Setup:

Recommended web browsers: Firefox or Chrome; Safari and Edge should also work for most tasks, but have not been fully tested.
Create a ctext account and log in
Check your e-mail (and spam folder) for an e-mail sent from the system, and click the link in the e-mail to validate your account.
Go to “Settings” at the bottom left, enter the API key specified in the live session in the box under “API key”, and click “Save”
Install the Text Tools plugin by opening this link, and then clicking “Install”
Install the Annotation plugin by opening this link, and then clicking “Install”

Tutorials

Practical introduction to ctext.org – interactive guide to core functionality of the Chinese Text Project.
Text Tools for ctext.org – interactive guide to using the Text Tools plugin for the Chinese Text Project for text mining and data visualization.
ctext.org Data Wiki tutorial – interactive guide to working with the Data wiki
the posts on text reuse and regular expressions on Digital Sinology.

For those interested in using ctext with Python (or another programming language), see also:

Classical Chinese Digital Humanities (on Digital Sinology) – step-by-step guide to getting started with programming in Python, and accessing the CTP API for simple text mining.
“ctext” Python module and CTP API documentation
Some [very] brief notes on using a ctext data wiki RDF dump in Python

These parts of the ctext.org instructions should also be useful:

Lastly, some of these papers may be of interest:

Digitizing Premodern Text with the Chinese Text Project, Journal of Chinese History 2020, 4(2).
Chinese Text Project: a dynamic digital library of premodern Chinese, Digital Scholarship in the Humanities (2019)
Digital Approaches to Text Reuse in the Early Chinese Corpus, Journal of Chinese Literature and Culture 2018, 5(2).
Large-scale Optical Character Recognition of Pre-modern Chinese Texts, International Journal of Buddhist Thought and Culture 2018, 28(2).
Unsupervised Identification of Text Reuse in Early Chinese Literature, Digital Scholarship in the Humanities (2018)

Local (non-ctext) annotation example
Download and save the file: qidan-guozhi1.xml (契丹國志卷一). You can load this into the Annotation client without using the ctext.org API.

Using materials not in classical Chinese

If we have time, we’ll quickly look at how this works in Text Tools. For simplicity, here are links to two very simple sets of example data:

English example: alice_in_wonderland.txt and alice_adventures_underground.txt
Modern Chinese: articles.zip: a zip file containing a few modern newspaper articles

Comments Off

Constructing a crowdsourced linked open knowledge base of Chinese history

Posted on October 12, 2021 by dsturgeon

Paper presented at Pacific Neighborhood Consortium Conference 2021

Abstract

This paper introduces a crowdsourced approach to knowledge base construction for historical data based upon annotation of historical source materials. Building on an existing digital library of premodern Chinese texts and adapting techniques from other annotation and knowledge base projects, this lays the groundwork for a scalable, sustainable, linked open repository of data covering around 3000 years of recorded Chinese history.

[Full text via IEEE (paywall)]

Comments Off

Donald Sturgeon

D-SEA Workshop

CUHK workshop 2024

PKU workshop 2023

投影片下载

Demo用的文本集

其它相关连接

CUHK workshop

SPARQL querying for ctext.org data

Referring to entities in RDF

AAS 2023

Crowdsourcing the Historical Record: Creating Linked Open Data for Chinese History at Scale

ctext Data Wiki tutorial

Getting started

Properties and qualifiers

Working with dates

ESSCS workshop

Constructing a crowdsourced linked open knowledge base of Chinese history

Recent Posts

Archives

Categories