MARAAS workshop

Materials from a workshop held as part of the MARAAS Conference: Asian Studies in the Digital Age at Dickinson College, Carlisle, PA. [Download slides]

Setup

  • Create a free account on ctext.org and log in.
  • Make sure to validate your e-mail address by opening the link the system sent you (if not, the link above will display a warning/reminder in red to do so).
  • Enter the API key in the box labeled “API key”, and click “Save”.

We will follow parts of the “Practical introduction to ctext.org” and “Text Tools for ctext.org” tutorials with a few changes and a few new features not yet included in the tutorials.

Link to Text Tools: http://ctext.org/plugins/texttools/#help

Important note/reminder: For tools which have this option, we will use “Tokenize by character” set to “On” for the Chinese materials, and “Off” for the English ones.

Overviews of functionality

The following give some basic illustrations of what can be done in Text Tools through concrete examples:

Other suggested examples

As well as the examples shown in the tutorials:

  • To see how the tool works with tokenized materials, download the following English text files (e.g. right-click each link and choose “Save as”):
  • Try some regexes on the English examples. A useful expression is likely to be “\w+” – any sequence of non-punctuation characters (intuitively, a word). Try as an example “the \w+”.
  • Using the “English_wordlist.txt” file as a list of regexes (just paste the contents of the file into the “Regex” box), generate vectors for the two Wizard of Oz stories. Run PCA on the results – you should see interesting differences between the two. Also try preprocessing the data by tokenizing and lowercasing the texts.
  • Try tokenizing one or more modern Chinese documents [example].
  • Using the regex tool with “Group rows by” set to “None” and “Extract groups” checked, try extracting data about biographies in the 宋史. You may want to start by using a small part of the text, e.g. ctp:ws55241. Example regex: (\w+),字(\w+),(\w+)人。

Related research

This entry was posted in Digital Humanities, Video. Bookmark the permalink.

Comments are closed.