Materials from a workshop held as part of Working with different kinds of ‘text’ in the Digital Humanities at the Singapore University of Technology and Design.
Setup
- Create a free account on ctext.org and log in.
- Make sure to validate your e-mail address by opening the link the system sent you (if not, the link above will display a warning/reminder in red to do so).
- Enter the API key in the box labeled “API key”, and click “Save”.
- Download the following English text files (e.g. right-click each link and choose “Save as”):
We will follow parts of the “Practical introduction to ctext.org” and “Text Tools for ctext.org” tutorials with a few changes to use English language texts as well as Chinese ones, and a few new features of the beta version not yet included in the tutorials.
Link to Text Tools (beta version): http://ctext.org/plugins/texttoolsbeta/#help
Important note/reminder: For tools which have this option, we will use “Tokenize by character” set to “On” for the Chinese materials, and “Off” for the English ones.
Other suggested examples
As well as the examples shown in the tutorials:
- Try some regexes on the English examples. A useful expression is likely to be “\w+” – any sequence of non-punctuation characters (intuitively, a word). Try as an example “the \w+”.
- Using the “English_wordlist.txt” file as a list of regexes (just paste the contents of the file into the “Regex” box), generate vectors for the two Wizard of Oz stories. Run PCA on the results – you should see interesting differences between the two. Also try preprocessing the data by tokenizing and lowercasing the texts.
- Try tokenizing one or more modern Chinese documents [example].
- Using the regex tool with “Group rows by” set to “None” and “Extract groups” checked, try extracting data about biographies in the 宋史. You may want to start by using a small part of the text, e.g. ctp:ws55241. Example regex: (\w+),字(\w+),(\w+)人。