Publication

Development of natural language processing techniques and resources for Old Irish; with an application for the detection of authors in the Würzburg Glosses

Citation
Abstract
Old Irish is lacking in digital resources compared to even other historical European languages, and relatively few attempts have been made to apply well-established natural language processing techniques to the language. Where attempts have been made, either to create resources or apply modern computational techniques, it has become apparent that certain roadblocks exist in relation to Old Irish which do not obstruct similar efforts in other languages. These roadblocks are not clearly identified in the literature, however, this research suggests that issues relating to tokenisation, part-of-speech tagging, and associated grammatical implications are among the most significant. Little focus has been given to these factors until now, and a conclusive review detailing their impact on attempts to create digital resources for Old Irish has never been carried out. Moreover, no attempt has been made to demonstrate that removing these roadblocks can enable the successful application of established natural language processing techniques to Old Irish text. This research addresses major factors limiting success in the digitisation of Old Irish text and the application of established natural language processing techniques to it. Moreover, it is demonstrated that these factors can be overcome. This necessitates an assessment of common practice in text digitisation and natural language processing techniques as applied to other languages, and an assessment of both the manuscript orthography and the grammatical tradition which has been built up around Old Irish. Where other languages have seen success in digitisation and natural language processing projects, the linguistic features which distinguish these languages from Old Irish are examined in an attempt to mitigate their effect on success rates when attempting such projects for Old Irish text. It is demonstrated that many of the factors limiting success rates can be alleviated by moving away from the conventions of Old Irish grammar which were formalised at the turn of the last century, at least, on a sub-surface, computational level. That is to say that it is possible to process the text in a manner which deviates from the traditional grammar of Old Irish, but still represent it to an end user in a more conventional manner. Certain assumptions regarding the nature of written language are inherent in the formats of many of the most common frameworks for the collection of annotated digital text, and hence, are inherent in natural language processing techniques which depend upon this type of text data. Many of these assumptions, though fundamental enough to have been overlooked in some cases, are shown to be mismatched with either the orthographic reality of Old Irish text, or with the grammatical tradition of the language. A tokenisation and part-of-speech tagging standard was developed for Old Irish in an attempt to overcome these mismatched expectations of the language. In this thesis it is demonstrated that this new standard for word separation is more suitable for the digital representation of Old Irish text than any which has come before it. It is shown to enable the successful application of certain natural language processing techniques to the language for the first time, as well as enabling the creation of a machine-readable lexicon of Old Irish tokens, which has not been possible until now as a result of disagreement between scholars regarding word boundaries and subsequent inconsistencies between resources created. A case study in the suitability of this standard is detailed in which it is shown that the application of well documented author recognition techniques, with a proven track record in other languages, to the text of the Würzburg glosses enables the successful separation of the work of the three scribal hands. The results of this experiment not only demonstrate the suitability of the tokenisation and part-of-speech tagging standard applied, but also add evidence to the debate on the authorship and composition of Old Irish gloss material.
Publisher
University of Galway
Publisher DOI
Rights
Attribution-NonCommercial-NoDerivatives 4.0 International