Transfer learning for historical corpora: An assessment on post-OCR correction and named entity recognition

Konstantin Todorov*, Giovanni Colavizza

*Corresponding author af dette arbejde

Publikation: Bidrag til tidsskriftKonferenceartikelForskningpeer review

8 Citationer (Scopus)

Abstract

Transfer learning in Natural Language Processing, mainly in the form of pre-trained language models, has recently delivered substantial gains across a range of tasks. Scholars and practitioners working with OCRed historical corpora are thus increasingly exploring the use of pre-trained language models. Nevertheless, the specific challenges posed by historical documents, including OCR quality and linguistic change, call for a critical assessment of the use of pre-trained language models in this setting. We consider two shared tasks, ICDAR2019 (post-OCR correction) and CLEF-HIPE-2020 (Named Entity Recognition, NER), and systematically assess using pre-trained language models with data in French, German and English. We find that using pre-trained language models helps with NER but less so with post-OCR correction. Pre-trained language models should therefore be used critically when working with OCRed historical corpora. We release our code base, in order to allow replicating our results and testing other pre-trained representations.

OriginalsprogEngelsk
TidsskriftCEUR Workshop Proceedings
Vol/bind2723
Sider (fra-til)310-339
Antal sider30
ISSN1613-0073
StatusUdgivet - 2020
Udgivet eksterntJa
Begivenhed1st Workshop on Computational Humanities Research, CHR 2020 - Virtual, Amsterdam, Holland
Varighed: 18 nov. 202020 nov. 2020

Konference

Konference1st Workshop on Computational Humanities Research, CHR 2020
Land/OmrådeHolland
ByVirtual, Amsterdam
Periode18/11/202020/11/2020

Bibliografisk note

Publisher Copyright:
© 2020 Copyright for this paper by its authors.

Citationsformater