Paper presented at AAS 2016, Seattle, April 1, 2016
The classical Chinese corpus has long been recognized to contain a vast amount of text reuse: closely related textual content that, for a variety of reasons, occurs in multiple works that might otherwise be considered to be quite independent creations ascribed to entirely different authors. Although this reuse occasionally involves explicit citation of a particular work, or acknowledgment that what follows is a widely known saying as opposed to an original invention of the author, far more often no indication is given that a passage may have been borrowed from elsewhere. Identifying such instances of reuse can shed light upon difficult issues of authorship and textual history, as well as highlight textual variations that can provide clues to the interpretation of obscure or disputed passages.
Digital methods make possible the exploration and analysis of text reuse not only in isolated instances, but systematically across a corpus of works as a whole. In this paper I propose methods of identifying two distinct types of text reuse in the classical Chinese corpus and provide an evaluation of the degrees of accuracy achieved. The first is overtly similar or “parallel” passages, which can be reliably located by defining and maximizing appropriate similarity metrics over regions of text. The second is less direct allusion to the content of earlier works, and is considerably more challenging to identify. I propose an approach that makes use of information retrieval and machine learning techniques, while also leveraging statistical data derived from the more easily identified “parallel” passages.