Materials from a workshop held as part of the MARAAS Conference: Asian Studies in the Digital Age at Dickinson College, Carlisle, PA. [Download slides]
Setup
- Create a free account on ctext.org and log in.
- Make sure to validate your e-mail address by opening the link the system sent you (if not, the link above will display a warning/reminder in red to do so).
- Enter the API key in the box labeled “API key”, and click “Save”.
We will follow parts of the “Practical introduction to ctext.org” and “Text Tools for ctext.org” tutorials with a few changes and a few new features not yet included in the tutorials.
Link to Text Tools: http://ctext.org/plugins/texttools/#help
Important note/reminder: For tools which have this option, we will use “Tokenize by character” set to “On” for the Chinese materials, and “Off” for the English ones.
Overviews of functionality
The following give some basic illustrations of what can be done in Text Tools through concrete examples:
- Exploring text reuse with Text Tools for ctext.org
- Regular expressions with Text Tools for ctext.org
Other suggested examples
As well as the examples shown in the tutorials:
- To see how the tool works with tokenized materials, download the following English text files (e.g. right-click each link and choose “Save as”):
- Try some regexes on the English examples. A useful expression is likely to be “\w+” – any sequence of non-punctuation characters (intuitively, a word). Try as an example “the \w+”.
- Using the “English_wordlist.txt” file as a list of regexes (just paste the contents of the file into the “Regex” box), generate vectors for the two Wizard of Oz stories. Run PCA on the results – you should see interesting differences between the two. Also try preprocessing the data by tokenizing and lowercasing the texts.
- Try tokenizing one or more modern Chinese documents [example].
- Using the regex tool with “Group rows by” set to “None” and “Extract groups” checked, try extracting data about biographies in the 宋史. You may want to start by using a small part of the text, e.g. ctp:ws55241. Example regex: (\w+),字(\w+),(\w+)人。
Related research
- Digital Approaches to Text Reuse in the Early Chinese Corpus, Journal of Chinese Literature and Culture 2018, 5(2).
- Unsupervised Identification of Text Reuse in Early Chinese Literature, Digital Scholarship in the Humanities (2018)
- Chinese Text Project: a dynamic digital library of premodern Chinese, Digital Scholarship in the Humanities (2019)