SUTD Workshop

Materials from a workshop held as part of Working with different kinds of ‘text’ in the Digital Humanities at the Singapore University of Technology and Design.


We will follow parts of the “Practical introduction to” and “Text Tools for” tutorials with a few changes to use English language texts as well as Chinese ones, and a few new features of the beta version not yet included in the tutorials.

Link to Text Tools (beta version):

Important note/reminder: For tools which have this option, we will use “Tokenize by character” set to “On” for the Chinese materials, and “Off” for the English ones.

Other suggested examples

As well as the examples shown in the tutorials:

  • Try some regexes on the English examples. A useful expression is likely to be “\w+” – any sequence of non-punctuation characters (intuitively, a word). Try as an example “the \w+”.
  • Using the “English_wordlist.txt” file as a list of regexes (just paste the contents of the file into the “Regex” box), generate vectors for the two Wizard of Oz stories. Run PCA on the results – you should see interesting differences between the two. Also try preprocessing the data by tokenizing and lowercasing the texts.
  • Try tokenizing one or more modern Chinese documents [example].
  • Using the regex tool with “Group rows by” set to “None” and “Extract groups” checked, try extracting data about biographies in the 宋史. You may want to start by using a small part of the text, e.g. ctp:ws55241. Example regex: (\w+),字(\w+),(\w+)人。
This entry was posted in Chinese, Digital Humanities. Bookmark the permalink.

Comments are closed.