Abstract
Grammatical Error Correction (GEC) is the research field concerned with computational methods for correcting grammatical errors in text. With the vast amounts of content currently being produced online, these methods hold the promise of improving human communication by enabling clear and error-free prose.
While GEC is a thoroughly studied field in academia, industrial adoption has been limited. Three specific obstacles are particularly holding back wide-spread industrial adoption: current academic GEC systems 1) depend on a lot of expensive data for training the systems; 2) are mostly evaluated on text written by English language learners, leaving the systems’ performance beyond this domain unclear; and 3) are mainly developed for the English language.
This thesis presents research into tackling these obstacles, in order to bridge the gap between academic research and industrial use. In the first part of the thesis, we investigate two avenues for building low-resource GEC systems. Firstly, we show that leveraging artificially generated training data improves systems’ ability to detect subject-verb-agreement errors, particularly improving robustness to challenging linguistic phenomena. Secondly, we show that language models
trained by self-supervision can be used for creating viable GEC systems that do not rely on annotated training data. In the second part of the thesis, we look into GEC systems’ ability to generalize beyond the English language learner domain – we release a new GEC benchmark, CWEB, consisting of website text annotated for correctness, and show that current GEC systems do not generalize well to this domain. In the final part, we focus on GEC for non-English languages and investigate strategies for leveraging available sources of noisy data. We show that GEC systems pre-trained on noisy data can be fine-tuned effectively on only small amounts of expert-annotated data, which opens up for creating inexpensive GEC systems in new languages.
While GEC is a thoroughly studied field in academia, industrial adoption has been limited. Three specific obstacles are particularly holding back wide-spread industrial adoption: current academic GEC systems 1) depend on a lot of expensive data for training the systems; 2) are mostly evaluated on text written by English language learners, leaving the systems’ performance beyond this domain unclear; and 3) are mainly developed for the English language.
This thesis presents research into tackling these obstacles, in order to bridge the gap between academic research and industrial use. In the first part of the thesis, we investigate two avenues for building low-resource GEC systems. Firstly, we show that leveraging artificially generated training data improves systems’ ability to detect subject-verb-agreement errors, particularly improving robustness to challenging linguistic phenomena. Secondly, we show that language models
trained by self-supervision can be used for creating viable GEC systems that do not rely on annotated training data. In the second part of the thesis, we look into GEC systems’ ability to generalize beyond the English language learner domain – we release a new GEC benchmark, CWEB, consisting of website text annotated for correctness, and show that current GEC systems do not generalize well to this domain. In the final part, we focus on GEC for non-English languages and investigate strategies for leveraging available sources of noisy data. We show that GEC systems pre-trained on noisy data can be fine-tuned effectively on only small amounts of expert-annotated data, which opens up for creating inexpensive GEC systems in new languages.
Original language | English |
---|
Publisher | Department of Computer Science, Faculty of Science, University of Copenhagen |
---|---|
Number of pages | 111 |
Publication status | Published - 2021 |