TY - GEN
T1 - The Danish Gigaword Corpus
AU - Strømberg-Derczynski, Leon
AU - Ciosici, Manuel Rafael
AU - Christiansen, Morten H.
AU - Baglini, Rebekah Brita
AU - Dalsgaard, Jacob Aarup
AU - Fusaroli, Riccardo
AU - Henrichsen, Peter Juel
AU - Hvingelby, Rasmus
AU - Kirkedal, Andreas
AU - Kjeldsen, Alex Speed
AU - Ladefoged, Claus
AU - Nielsen, Finn Arup
AU - Madsen, Jens
AU - Petersen, Malte Lau
AU - Rystrøm, Jonathan Hvithamar
AU - Varab, Daniel
PY - 2021
Y1 - 2021
N2 - Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers’ socio-economic status, and Danish dialects.
AB - Danish language technology has been hindered by a lack of broad-coverage corpora at the scale modern NLP prefers. This paper describes the Danish Gigaword Corpus, the result of a focused effort to provide a diverse and freely-available one billion word corpus of Danish text. The Danish Gigaword corpus covers a wide array of time periods, domains, speakers’ socio-economic status, and Danish dialects.
M3 - Article in proceedings
SP - 413
EP - 421
BT - Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)
PB - Linköping University Electronic Press
ER -