Abstract
Large, high-quality datasets are crucial for training Large Language Models (LLMs). However, so far, few datasets are available for specialized critical domains such as law and the available ones are often small and only in English. To fill this gap, we curate and release MULTILEGALPILE, a 689GB corpus in 24 languages from 17 jurisdictions. MULTILEGALPILE includes diverse legal data sources and allows for pretraining NLP models under fair use, with most of the dataset licensed very permissively. We pretrain two RoBERTa models and one Longformer multilingually, and 24 monolingual models on each of the language-specific subsets and evaluate them on LEXTREME. Additionally, we evaluate the English and multilingual models on LexGLUE. Our multilingual models set a new SotA on LEXTREME and our English models on LexGLUE. We release the dataset, trained models, and all code under the most open licenses possible.
Original language | English |
---|---|
Title of host publication | Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) |
Editors | Lun-Wei Ku, Andre F. T. Martins, Vivek Srikumar |
Number of pages | 18 |
Publisher | Association for Computational Linguistics (ACL) |
Publication date | 2024 |
Pages | 15077-15094 |
ISBN (Electronic) | 9798891760943 |
DOIs | |
Publication status | Published - 2024 |
Event | 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 - Bangkok, Thailand Duration: 11 Aug 2024 → 16 Aug 2024 |
Conference
Conference | 62nd Annual Meeting of the Association for Computational Linguistics, ACL 2024 |
---|---|
Country/Territory | Thailand |
City | Bangkok |
Period | 11/08/2024 → 16/08/2024 |
Sponsor | Apple, et al., Google DeepMind, LG AI Research, Meta AI, NewsBreak |
Bibliographical note
Publisher Copyright:© 2024 Association for Computational Linguistics.