Abstract
Transfer learning in Natural Language Processing, mainly in the form of pre-trained language models, has recently delivered substantial gains across a range of tasks. Scholars and practitioners working with OCRed historical corpora are thus increasingly exploring the use of pre-trained language models. Nevertheless, the specific challenges posed by historical documents, including OCR quality and linguistic change, call for a critical assessment of the use of pre-trained language models in this setting. We consider two shared tasks, ICDAR2019 (post-OCR correction) and CLEF-HIPE-2020 (Named Entity Recognition, NER), and systematically assess using pre-trained language models with data in French, German and English. We find that using pre-trained language models helps with NER but less so with post-OCR correction. Pre-trained language models should therefore be used critically when working with OCRed historical corpora. We release our code base, in order to allow replicating our results and testing other pre-trained representations.
Originalsprog | Engelsk |
---|---|
Tidsskrift | CEUR Workshop Proceedings |
Vol/bind | 2723 |
Sider (fra-til) | 310-339 |
Antal sider | 30 |
ISSN | 1613-0073 |
Status | Udgivet - 2020 |
Udgivet eksternt | Ja |
Begivenhed | 1st Workshop on Computational Humanities Research, CHR 2020 - Virtual, Amsterdam, Holland Varighed: 18 nov. 2020 → 20 nov. 2020 |
Konference
Konference | 1st Workshop on Computational Humanities Research, CHR 2020 |
---|---|
Land/Område | Holland |
By | Virtual, Amsterdam |
Periode | 18/11/2020 → 20/11/2020 |
Bibliografisk note
Publisher Copyright:© 2020 Copyright for this paper by its authors.