MultiLegalPile: A 689GB Multilingual Legal Corpus

Joel Niklaus, Veton Matoshi, Matthias Stürmer, Ilias Chalkidis, Daniel E. Ho

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

4 Citations (Scopus)
8 Downloads (Pure)

Abstract

Large, high-quality datasets are crucial for training Large Language Models (LLMs). However, so far, few datasets are available for specialized critical domains such as law and the available ones are often small and only in English. To fill this gap, we curate and release MULTILEGALPILE, a 689GB corpus in 24 languages from 17 jurisdictions. MULTILEGALPILE includes diverse legal data sources and allows for pretraining NLP models under fair use, with most of the dataset licensed very permissively. We pretrain two RoBERTa models and one Longformer multilingually, and 24 monolingual models on each of the language-specific subsets and evaluate them on LEXTREME. Additionally, we evaluate the English and multilingual models on LexGLUE. Our multilingual models set a new SotA on LEXTREME and our English models on LexGLUE. We release the dataset, trained models, and all code under the most open licenses possible.

Original languageEnglish
Title of host publicationProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
EditorsLun-Wei Ku, Andre F. T. Martins, Vivek Srikumar
Number of pages18
PublisherAssociation for Computational Linguistics (ACL)
Publication date2024
Pages15077-15094
ISBN (Electronic)9798891760943
DOIs
Publication statusPublished - 2024
Event62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Bangkok, Thailand
Duration: 11 Aug 202416 Aug 2024

Conference

Conference62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024
Country/TerritoryThailand
CityBangkok
Period11/08/202416/08/2024
SponsorApple, et al., Google DeepMind, LG AI Research, Meta AI, NewsBreak

Bibliographical note

Publisher Copyright:
© 2024 Association for Computational Linguistics.

Cite this