Enhancing CLARIN-DK Resources While Building the Danish ParlaMint Corpus

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

Abstract

In this paper we describe the Danish CLARIN resources, corpora, tools and workflow, which we used and enhanced in order to build the Danish ParlaMint corpus, as part of the CLARIN founded ParlaMint project. More specifically, the article accounts for the manual and automatic processes involved in the preparation of the Danish Parliamentary speeches with focus on the CLARIN-DK tools and Text Tonsorium workflow management. The tools annotated the speeches with metadata and linguistic information in compliance with the common ParlaMint TEI P5 format. As a spin-off of the project, the CLARIN-DK sen-tence tokenizer and the CST Named Entity Recognizer were improved. These tools, to-gether with the CST-lemmatiser, Danish UD-Pipe software and several data transformation utilities, produced all the linguistic annotations in the correct format. We conclude the pa-per with a report of a pilot evaluation of the quality of some of the linguistic annotations in the Danish ParlaMint corpus.
Original languageEnglish
Title of host publicationCLARIN Annual Conference 2021 Proceedings
PublisherCLARIN ERIC
Publication date2021
Pages70-73
Publication statusPublished - 2021

Cite this