MultiLegalPile: A 689GB Multilingual Legal Corpus

Joel Niklaus; Veton Matoshi; Matthias Stürmer; Ilias Chalkidis; Daniel Ho

2024 ACL ACL 2024

MultiLegalPile: A 689GB Multilingual Legal Corpus

Abstract

AbstractLarge, high-quality datasets are crucial for training Large Language Models (LLMs). However, so far, few datasets are available for specialized critical domains such as law and the available ones are often small and only in English. To fill this gap, we curate and release MultiLegalPile, a 689GB corpus in 24 languages from 17 jurisdictions. MultiLegalPile includes diverse legal data sources and allows for pretraining NLP models under fair use, with most of the dataset licensed very permissively. We pretrain two RoBERTa models and one Longformer multilingually, and 24 monolingual models on each of the language-specific subsets and evaluate them on LEXTREME. Additionally, we evaluate the English and multilingual models on LexGLUE. Our multilingual models set a new SotA on LEXTREME and our English models on LexGLUE. We release the dataset, trained models, and all code under the most open licenses possible.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — legal corpus

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Joel Niklaus , Veton Matoshi , Matthias Stürmer , Ilias Chalkidis , Daniel Ho

Topics

Artificial Intelligence > Core AI > Foundation Models Natural Language Processing > Resources & Methods > Multilingual NLP Machine Learning > Learning Types > Transfer Learning Deep Learning > Models > Large Language Models Machine Learning > Learning Types > Multi-Lingual Learning

Keywords

domain adaptation natural language understanding pretrained language model multilingual dataset multilingual language model legal nlp legal natural language processing legal corpus

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024