Paper presented at Greek and Latin in an age of Open Data:
Phrase-based alignment of classical Chinese and English
Donald Sturgeon and John S. Y. Lee
Abstract
Aligned parallel corpora are useful for a variety of purposes including machine translation and statistical studies, as well as making possible new and innovative digital tools for use in pedagogy and research. Alignments can be made at various levels of granularity, a common type being alignment of sentences. In the case of classical Chinese in particular, databases containing such alignments are also of direct utility to scholars and linguists due to the complex semantics of individual terms of the language, the limited size of the extant body of writing, and a lack of sufficiently comprehensive bilingual dictionaries. Aligned corpora make possible automated extraction of relevant linguistic data for arbitrary terms, while avoiding the prohibitively high cost involved in manual construction of an adequate bilingual dictionary.
While in many modern languages sentences are delimited in the written form by the presence of certain punctuation marks, classical Chinese was for many centuries written without any punctuation marks whatsoever, and later with punctuation that delimited only boundaries between phrases. Modern editions of classical Chinese texts include punctuation marks corresponding closely to (and greatly influenced by) modern English punctuation, but often disagree on the precise details of such punctuation, highlighting the degree of freedom present in adding such marks. Due to the grammar of classical Chinese, this freedom often extends to choices determining apparent sentence boundaries. Similarly and partly as a result of this, English translations of these texts often differ in the precise delimiting of sentences in the source text.
As a result of these linguistic and historical factors, sentence-based alignment of classical Chinese texts and their modern translations is problematic, as sentences of the source and target languages often fail to correspond exactly due to different choices made in punctuating the text, even where these do not correspond to significant differences in interpretation. By contrast however due to the much lower degree of freedom involved, different modern editions of early texts exhibit much less disagreement regarding the delimiting of phrases.
Motivated by these factors, this study investigates automated phrase-wise alignment of a corpus of classical Chinese texts and their English translations, comparing unsupervised machine-generated phrase-wise alignments versus sentence-wise alignments by means of human annotated results.