Abstract
In recent years, natural language processing (NLP) has proven a compelling tool to model various linguistic phenomena by learning from the language contained in text resources. However, despite the continuous development of language, both resources and models commonly treat language as synchronous. In this dissertation, we ask to what extent NLP models capture notions of language change when presented with tasks involving diachronic text corpora. In particular, we consider two tasks, namely (1) to estimate when a text was written, and (2) to detect language change. While these tasks are traditionally treated within the disciplines of linguistics and philology, the increasing amount of digitalised historical text corpora offers an exciting new opportunity to investigate the way NLP methods can contribute to these tasks.
This dissertation is article-based and includes six independent studies presented according to two main themes: The first collection of studies focuses on the task of date estimation. As several different types of phenomena can be treated when studying language change, a central question in computational approaches to the problem of date estimation is what type of phenomena they aim to capture; a choice that is ultimately realised in what features are extracted from the text. In the literature review, we identify two main trends: One is representing text in terms of word sequences, while the other makes use of character sequences. Through a reproduction of approaches from the literature on a diverse set of diachronic text corpora, we find that character models tend to outperform their word-level counterparts. In particular, when evaluated on a novel dataset of medieval charters from Denmark, we find that the predictions made by character models in 75 % of the test cases are within the margin of error employed in philological research.
The second collection of studies focuses on the interpretation of character models to evaluate their ability to capture sound change. While character models have the ability to capture sublexical phenomena relating to morphology and phonology, the extent to which they do so (or whether they are merely efficient representations of lexical terms) is open to debate. By an intrinsic evaluation of the models, we demonstrate that they can capture (morpho)phonological features relating to the alternation of the linking element between Danish compounds and phonological distinction. These results lead us to propose a novel approach to detect sound change using diachronic character embeddings trained on historical text sources.
The dissertation contributes with promising results that demonstrate that NLP models can be used to capture language change. This motivates future research directions investigating the way NLP methods can be utilised in data-driven testing of linguistic hypotheses of language change. A remaining challenge to determine how well the behaviour observed during training can be generalised to test data from different text collections. We argue that the features captured in character models have the potential to provide a broader generalisation in terms of the features they capture. However, it is still an open question to what extent these features are independent of corpora-inherent features such as the lexicon.
This dissertation is article-based and includes six independent studies presented according to two main themes: The first collection of studies focuses on the task of date estimation. As several different types of phenomena can be treated when studying language change, a central question in computational approaches to the problem of date estimation is what type of phenomena they aim to capture; a choice that is ultimately realised in what features are extracted from the text. In the literature review, we identify two main trends: One is representing text in terms of word sequences, while the other makes use of character sequences. Through a reproduction of approaches from the literature on a diverse set of diachronic text corpora, we find that character models tend to outperform their word-level counterparts. In particular, when evaluated on a novel dataset of medieval charters from Denmark, we find that the predictions made by character models in 75 % of the test cases are within the margin of error employed in philological research.
The second collection of studies focuses on the interpretation of character models to evaluate their ability to capture sound change. While character models have the ability to capture sublexical phenomena relating to morphology and phonology, the extent to which they do so (or whether they are merely efficient representations of lexical terms) is open to debate. By an intrinsic evaluation of the models, we demonstrate that they can capture (morpho)phonological features relating to the alternation of the linking element between Danish compounds and phonological distinction. These results lead us to propose a novel approach to detect sound change using diachronic character embeddings trained on historical text sources.
The dissertation contributes with promising results that demonstrate that NLP models can be used to capture language change. This motivates future research directions investigating the way NLP methods can be utilised in data-driven testing of linguistic hypotheses of language change. A remaining challenge to determine how well the behaviour observed during training can be generalised to test data from different text collections. We argue that the features captured in character models have the potential to provide a broader generalisation in terms of the features they capture. However, it is still an open question to what extent these features are independent of corpora-inherent features such as the lexicon.
Originalsprog | Engelsk |
---|
Udgivelsessted | København |
---|---|
Forlag | Det Humanistiske Fakultet, Københavns Universitet |
Antal sider | 141 |
Status | Udgivet - 14 jan. 2022 |