Text Tools for ctext.org

This tutorial introduces some of the main functionality of the “Text Tools” plugin for the Chinese Text Project database and digital library along with suggested example tasks and use cases.

[Online version of this tutorial: https://dsturgeon.net/texttools (English); https://dsturgeon.net/texttools-ja (Japanese)]

Initial setup

  • If you haven’t used the Chinese Text Project before, please refer to the tutorial “Practical introduction to ctext.org” for details of how to create a ctext.org account and install a plugin.
  • Make sure you are logged in to your ctext.org account.
  • If you have an API key, save it into your ctext.org account using the settings page. Alternatively if your institution subscribes to ctext.org and you are not using a computer on your university’s local network, follow your university’s instructions to connect to their VPN.
  • Install the “Text Tools” plugin (installation link) – you only need to do this once.
  • Once these steps have been completed, when you open a text or chapter of text on ctext.org, you should see a link to the Text Tools plugin.

Getting started

The Text Tools program has a number of different pages (titled “N-gram”, “Regex”, etc.) which can be switched between using the links at the top of the page. Each page corresponds to one of the tools described below, except for the Help page, which explains the basic usage and options for each of the tools. These include tools for textual analysis as well as simple data visualization.

The textual analysis tools are designed to operate on textual data which can either be read in directly from ctext.org via API, or copied into the tool from elsewhere. If you open the tool by using the ctext.org plugin, that text will be automatically loaded and displayed. To load additional texts from ctext, copy the URN for the text (or chapter) into the box labeled “Fetch text by URN” in the Text Tools window, and click “Fetch”. When the text has loaded, its contents will be displayed along with its title. To add more texts, click “Save/add another text”, then repeat the procedure. The list of currently selected texts is displayed at the top of the window.

N-grams

“N-grams” are sequences of n consecutive textual items, where n is some fixed integer (e.g. n=1, n=3, etc.). The “textual items” are usually either terms (words) or characters; for Chinese in particular, characters are frequently used rather than words because of the difficulty of accurately segmenting Chinese text into a sequence of separate words automatically. For instance, the sentence “學而時習之不亦說乎” contains the following character 3-grams (i.e. unique sequences of exactly three characters): “學而時”, “而時習”, “時習之”, “習之不”, “之不亦”, “不亦說”, “亦說乎”.

The Text Tools “N-gram” function can be used to give a simple overview of various types of word usage in Chinese texts by means of character n-grams. The simplest cases of n-grams are 1-grams, which are simply character occurrence counts or frequencies.

Exercise:
  • Try computing 1-grams for two or three texts from ctext – you will need to set “Value of n” to 1 to do this. To better visualize the trends, use the “Chart” link to plot a bar chart of the raw data. Try this with and without normalization.
  • Repeat with 2- and 3-grams.
  • If you chose texts which ought to be broadly comparable in length, try repeating with two texts of vastly different lengths and/or styles (e.g. 道德經 and 紅樓夢) with and without normalization to demonstrate how this alters the results.

Word clouds are another type of visualization that can be made with this type of data, in which labels are drawn in different sized text proportional to their frequency of occurrence (or, more usually, proportionally to the log of their frequency). Typically word clouds are created from a single text or merged corpus, using either characters or words, however the same principles extend naturally to n-grams (and regular expressions) generally, as well as to multiple texts. In Text Tools, visualizing data for multiple texts causes the data for each distinct text to be displayed in a different color. Similar comments apply regarding normalization: if counts for different texts are not normalized according to length, longer texts will naturally tend to have larger labels.

Exercise:
  • Create word clouds for a single text, and for two or more texts. Experiment with the “Use log scale” setting in the Word cloud tab – it should quickly become clear why a log scale is usually used for word clouds.

Textual similarity

The Similarity tool uses n-gram shingling to identify and visualize text reuse relationships. To use it, first load one or more texts, select any desired options, and click “Run”.

What is identified by this tool are shared n-grams between parts of the specified texts: rather than reporting all n-grams (as the N-gram tool does), this tool only reports n-grams that are repeated in more than one place, and calculates the total number of shared n-grams between pairs of chapters. Thus unlike the N-gram tool (when “minimum count is set to 1″), larger values of n will result in fewer results being reported, because shorter n-grams are more likely to occur in multiple places, while longer ones will be less common, as well as more strongly indicative of a text reuse relationship existing between the items being compared.

There are two tabs within the output for the similarity tool: the “Matched text” tab shows the n-grams which matched, with brighter shades of red corresponding to greater numbers of overlapping n-grams; the “Chapter summary” tab aggregates the counts of matched n-grams between all pairs of chapters.

Exercise:
  • Run the similarity tool on the Analects with n=5.
  • Experiment with the “Constraint” function by clicking on chapter titles to limit the display to passages having parallels with the specified chapter or pair of chapters.
  • Select a few of the matched n-grams by clicking on them; this will result in a different type of constraint showing where exactly that n-gram was matched
  • Text reuse can be visualized as a weighted network graph. You can do this for your n-gram similarity results by clicking the “Create graph” link in the “Chapter summary” tab, then clicking “Draw”.
  • Which chapters of the Analects have the strongest text reuse relationship according to this metric? You can probably see this straight away from the graph, however you can also check this numerically by returning to the Chapter summary tab, and sorting the table by similarity – clicking on the titles of columns in any Text Tools table sorts it by that column (click a second time to toggle sort order).
  • Returning to the graph (you can click the “Network” link at the top of the page to switch pages), the edges of the graph have a concrete meaning defined in terms of identified similarities. Double-clicking on an edge will reopen the Similarity tool, with the specific similarities underwriting the selected edge highlighted. Examine some of the edges using this function, including the thickest and thinnest edges.
  • Experiment with increasing and decreasing the value of n – how does this affect the results?
  • By default, the graph contains edges representing every similarity identified. Particularly for smaller values of n, some of these relationships will not be significant, and this may result in edges being drawn between almost all pairs of nodes in the graph, complicating the picture and obscuring genuine patterns. Experiment with simplifying the graph by setting a threshold (e.g. 0.001) for the “Skip edges with weight less than” setting – this will simplify the graph by removing those edges with relatively small amounts of reuse. Compare this with the results of increasing the value of n in the similarity tool, which will also decrease the number of edges as more trivial similarities are excluded.
  • The Similarity tools also works with multiple texts; if multiple texts are loaded and a graph is created, different colors will be used to distinguish between chapters of different texts. Try this with the Xunzi and the Zhuangzi, two very dissimilar texts which nonetheless do have reuse relationships with one another (this may take a few seconds to run – the similarity tool will take longer for larger amounts of text).

Regular expressions

A regular expression (often shortened to “regex”) is a pattern which can be searched for in a body of text. In the simplest case, a regular expression is simply a string of characters to search for; however by supplementing this simple idea with specially designed syntax, it is possible to express much more complex ways of searching for data.

The regex tool makes it possible to search within one or more texts for one or more regular expressions, listing matched text as well as aggregating counts of results per-text, per-chapter, or per-paragraph.

Exercise:
  • The simplest type of regular expression is simply a character string search – i.e. a list of characters in order which will match (only) that precise sequence of characters – one type of full-text search. Try searching the text of the Analects for something you would expect to appear in it (e.g. “君子”).
  • Examine the contents of the “Matched text” and “Summary” tabs.
  • Add a second search phrase (e.g. “小人”) to your search, and re-run the regex.
  • Re-run the same search again using the same two regular expressions, but changing “Group rows by” from the default “None” to “Paragraph”. When you do this, the “Summary” tab will show one row for every passage in the Analects. Try clicking on a numbered paragraph (these numbers are chosen automatically starting from the beginning of the text) – this will highlight the passage corresponding to that row.

Search results like these can be relational when grouped by a unit such as a paragraph or chapter: if two terms appear together in the same paragraph (or chapter), this can indicate some relationship between the two; if they repeatedly occur together in many paragraphs, this may indicate a stronger relationship between the two in that text. It is thus possible to use a network graph to visualize this information; you can do this in Text Tools by running regular expressions and setting “Group rows by” to “Paragraph”.

Exercise:
  • Search for the terms 父母, 君子, 小人, 禮, and 樂 in the Analects, and construct a network graph based on their co-occurrence in the same paragraphs of text.
  • Double-clicking on an edge in this graph will reopen the Regex tool, with the specific matches underwriting the selected edge highlighted. Examine some of the edges using this function, including the thickest and thinnest edges, to see what data they actually represent.
  • Using the same method but specifying a list of character names (寶玉, 黛玉, 寶釵, etc. – you can get a list of more names from Wikipedia), map out how character names co-occur in paragraphs of the Hongloumeng. Note: you will need to make sure that you choose names frequently used in the actual text (e.g. “賈寶玉” is only infrequently used; “寶玉” is far more common – and will also match cases of “賈寶玉”). This is one example of Social network analysis.
  • When you set “Group rows by” to “None”, you can temporarily add constraints to the “Matched text” view to show only those paragraphs which matched a particular search string. You can set or remove a constraint by clicking on a matched string in the “Matched text view”; you can also click the text label of an item in the “Summary” view to set that item as the constraint, and so see at a glance which paragraphs contained that particular string. Re-run your search with the same terms but in “None” mode, and use this to quickly see which passages the least-frequently occurring name from your list appeared in.

[A word of caution: when performing this type of search, it is important to examine the matched text to confirm whether "too much" may be matched, as well as whether other things may be missed. In the Hongloumeng example above, for instance, although the vast majority of string matches for "寶玉" in the text do indeed refer to 賈寶玉, another character appears later in the novel called "甄寶玉" - these will also match a simple search for the string "寶玉". In this particular example, this can be avoided by constructing a regular expression to avoid these other cases - such as the regex "(?!甄)寶玉", which will match the string "寶玉" only when it does not come immediately after a "甄".]

So far we have only used the simplest type of regular expressions. Regular expressions also allow for the specification of more complex patterns to be matched in the same way as the simple string searches we have just done – for example, the ability to specify a search for a pattern like “以[something]為[something]“, which would match things like “以和為量”, “以生為本”, or “以我為隱”. In order to do this, regular expressions are created by building on any fixed characters we want to match with the addition of “special” characters that describe patterns we are looking for.

Some of the most useful types of special syntax available in regular expressions is summarized in the following table:

. Matches any one character exactly once
[abcdef] Matches any one of the characters a,b,c,d,e,f exactly once
[^abcdef] Matches any one character other than a,b,c,d,e,f
(xyz) Matches xyz, and saves the result as a numbered group.
? After a character/group, makes that character/group optional (i.e. match zero or 1 times)
? After +, * or {…}, makes matching ungreedy (i.e. choose shortest match, not longest)
* After a character/group, makes that character/group match zero or more times
+ After a character/group, makes that character/group match one or more times
{2,5} After a character/group, makes that character/group match 2,3,4, or 5 times
{2,} After a character/group, makes that character/group match 2 or more times
{2} After a character/group, makes that character/group match exactly 2 times
\3 Matches whatever was matched into group number 3 (first group from left is numbered 1)

The syntax may seem complex, but it is quite easy to get started with. For instance, the first special syntax listed in the table above – a dot (“.”) – matches any one character. So the example above of “以[something]為[something]” can be expressed as the regular expression “以.為.”, read as “match the character ‘以’, followed by any one character, followed by ‘為’, followed by any one character”.

Exercise:
  • Try the regex “以.為.” from the example above in the Zhuangzi, using “Group by None”.
  • In the results of this regex search, you will notice that some matches may not correspond to exactly the type of expression we are really looking for. For example, the above regex will also match “以汝為?”, because punctuation characters are also counted as “characters” when matching regular expressions. One way to exclude these matches from the results is to use a negative character class (which matches everything except a specified list of characters) in the regex instead of the “.” operator (which simply matches any character). A corresponding regex for this example is “以[^。?]為[^。?]” – try this and confirm that it excludes these cases.
  • Because there are many possible punctuation characters, within Text Tools you can also use the shorthand “\W” (upper-case) to stand for any commonly used Chinese punctuation character, and “\w” (lower-case) for any character other than commonly used Chinese punctuation. You should get the same result if you try the previous regex written instead as “以\w為\w”. (Although this is a common convention for English regexes, “\w” and “\W” work slightly differently in different regex implementations and many do not support this for Chinese).
  • Write and test regular expressions to match the following in the Daodejing (ctp:dao-de-jing):
    • Any four characters where the middle is “之不” – i.e. “視之不見”, “聽之不聞”, etc.

Repetition

Repetition can be accomplished using various repetition operators and modifiers listed in the table above.

  • We can ask that any part of our regular expression be repeated some number of times using the “{a,b}” operator. This modifies the immediately preceding item in the regex (e.g. a specification of a character, or a group), requiring it to be repeated at least a times and at most b times (or any number of times, if b is left blank). If we omit the comma and just write “{a}”, this means that the preceding item must be repeated exactly a times.
  • For example, “仁.{0,10}義” will match the character “仁”, followed by anything from 0 to 10 other characters, followed by the character “義” – it will therefore match things like “仁義”, “仁為之而無以為;上義”, “仁,託宿於義”, etc.
  • The same method works with groups, and requires that the pattern specified by the group (not its contents) be repeated the specified number of times. So for instance “(人.){2,}” will match “人來人往”, “人前人後”, and also “人做人配人疼”.
  • The “+”, “*”, and “?” operators work in exactly the same way as this after a character or group: “+” is equivalent to “{1,}”, “*” to “{0,}”, and “?” to “{0,1}”. (They are, however, frequently used because they are shorter to write.)
Exercise:
  • Try the two specific examples described above (i.e. “仁.{0,10}義” and “(人.){2,}”).
  • Write and test regular expressions to match the following in the Daodejing (ctp:dao-de-jing):
    • Each “phrase” (i.e. punctuated section) of text. In other words, the first match should be “道可道”, the second should be “非常道”, and so on.
    • Match each phrase which contains the term “之” in it.
    • Match each phrase which contains the term “之” in it, but neither as the first character nor as the last.
  • Write and test regular expressions to match the following in the Mozi (ctp:mozi):
    • Any occurrences of the character “君” followed anywhere later in the same sentence by “父” (e.g. “君父”, “…君臣父…”, “君臣上下長幼之節,父…”, etc.).

Groups

Aside from repetition, a lot of the power of regular expressions comes from the ability to divide parts of a match into what are called “groups”, and express further conditions using the matched contents of these groups. This makes it possible to express much more sophisticated patterns.

  • Suppose we want to look for expressions like “君不君, “臣不臣”, “父不父”, etc. – cases where we have some character, followed by a “不”, then followed by that same character from before (i.e. we aren’t trying to match things like “人不知”).
  • We can do this by “capturing” the first character – whatever it may be – in a group, and then requiring later in our expression that we match the contents of that group again in another place.
  • Capturing something in a group is accomplished by putting parentheses around the part to capture – e.g. “(.)” matches any character and captures it in a group.
  • Groups are automatically numbered starting from 1, beginning with the leftmost opening bracket, and moving through our regex from left to right.
  • We can reference the contents of a matched group using the syntax “\1″ to match group 1, “\2″ to match group 2, etc.
  • So in our example, “(.).\1″ matches any character, followed by any character, followed by the first character again (whatever it was). Try this on the text of the Analects, then try modifying the regex so that it only matches non-punctuation characters (i.e. does not match things like “本,本”).

Another example is a common type of patterned repetition such as “禮云禮云” and “已乎已乎”. In this case, we can use exactly the same approach. One way is to write “(..)\1″ – match any two characters, then match those same two characters again; another (equivalent) way is to use two separate groups and write “(.)(.)\1\2″ – match any character X, then any character Y, then match X again and then Y again.

Exercise:
  • Write and test a regular expression which matches things like “委委佗佗”, “戰戰兢兢”, etc. in the Book of Poetry (ctp:book-of-poetry).
  • Write and test a regular expression which matches complex repetition of the style “XYZ,ZYX” in the Zhuangzi, where each of X, Y, and Z can be 1-5 characters long. Your regex should match things like “知者不言,言者不知”, “道無以興乎世,世無以興乎道”, and “安其所不安,不安其所安”.

Regex replace

The replace function works in a similar way to the regex search function: this function searches within one specified text for a specified regular expression, and replaces all occurrences of it with a specified value. Although the replacement can be a simple string of characters, it can also be designed to vary depending upon the contents of the regular expression. Specifically, anything that has been matched as a group within the search regex can be referenced in the replacement by using the syntax “$1″ to include the text match in group 1, “$2″ for group 2, etc. One common use case for regex replacements is to “tidy up” data obtained from some external source, or preparing it for use in some particular procedure.

For example:

  • Replacing “\W” with “” (an empty string) will delete all punctuation and line breaks from a text
  • Replacing “^(\w{1,20})$” with “*$1″ will add title markers to any lines which contain between 1 and 20 characters, none of which are punctuation characters – this can be useful when importing non-ctext texts.

Identifying differences between versions

The “Diff” tool provides a simple way of performing a character-by-character “Diff” of two similar pieces of text. Unlike the Similarity tool, this tool works best on input texts which are almost (but not quite) identical to one another.

Try using the Diff tool to compare the contents of the 正統道藏 edition of the 太上靈寶天尊說禳災度厄經 (ctp:wb882781) with the 重刊道藏輯要 edition of the same text (ctp:wb524325).

Network graphs

When you create a graph using the regular expression or similarity tools, the data is exported into the Network tab. For navigation instructions, refer to the “Help” tab. Graphs in the network tab can be entered in a subset of the “GraphViz” format; the graphs created by the other tabs can all be downloaded in this same format. If you would like a more flexible way of creating publication quality graphs, you can download and install Gephi (https://gephi.org/), which is also able to open these files.

Using other texts

Chinese texts from other sources besides ctext.org can be used with Text Tools. For instructions on how to prepare these, refer to the section on Loading texts on the Help page.

Creative Commons License
This entry was posted in Chinese, Digital Humanities. Bookmark the permalink.

Comments are closed.