ParlaMint: Comparable Corpora of European Parliamentary Data

Tomaž Erjavec, Maciej Ogrodniczuk, Petya Osenova, Petya Petya Osenova, Andrej Pancur, Nikola Ljubešic, Tommaso Agnoloni, StarkaDur Barkarson, María Calzada Pérez, Çagrı Çöltekin, Matthew Coole, Roberts Dargis, Luciana D. de Macedo, Jesse de Does, Katrien Depuydt, Sascha Diwersy, Dorte Haltrup Hansen, Matyáš Kopp, Tomas Krilavicius, Giancarlo LuxardoMaarten Marx, Vaidas Morkevicius, Costanza Navarretta, Paul Rayson, Orsolya Ring, Michał Rudolf, Kiril Simov, Steinþór Steingrímsson, István Üveges, Ruben van Heusden, Giulia Venturi

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

18 Downloads (Pure)

Abstract

This paper outlines the ParlaMint project from the perspective of its goals, tasks, participants, results and applications potential. The project produced language corpora from the sessions of the national parliaments of 17 countries, almost half a billion words in total. The corpora are split into COVID-related subcorpora (from November 2019) and reference corpora (to October 2019). The corpora are uniformly encoded according to the ParlaMint schema with the same Universal Dependencies linguistic annotations. Samples of the corpora and conversion scripts are available from the project’s GitHub repository. The complete corpora are openly available via the CLARIN.SI repository for download, and through the NoSketch Engine and KonText concordancers as well as through the Parlameter4 interface for exploration and analysis.
Original languageEnglish
Title of host publicationProceedings of CLARIN Annual Conference 2021
PublisherCLARIN ERIC
Publication date2021
Pages19-24
Publication statusPublished - 2021

Cite this