Community OSCAR: A Community Effort for Multilingual Web Data

Manuel Brack; Malte Ostendorff; Pedro Ortiz Suarez; Jose Javier Saiz; Iñaki Lacunza Castilla; Jorge Palomar-Giner; Alexander Shvets; Patrick Schramowski; Georg Rehm; Marta Villegas; Kristian Kersting

2024 EMNLP EMNLP 2024

Community OSCAR: A Community Effort for Multilingual Web Data

Abstract

AbstractThe development of large language models (LLMs) relies heavily on extensive, high-quality datasets. Publicly available datasets focus predominantly on English, leaving other language communities behind. To address this issue, we introduce Community OSCAR, a multilingual dataset initiative designed to address the gap between English and non-English data availability. Through a collective effort, Community OSCAR covers over 150 languages with 45 billion documents, totaling over 345 TiB of data. Initial results indicate that Community OSCAR provides valuable raw data for training LLMs and enhancing the performance of multilingual models. This work aims to contribute to the ongoing advancements in multilingual NLP and to support a more inclusive AI ecosystem by making high-quality, multilingual data more accessible to those working with low-resource languages.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Manuel Brack , Malte Ostendorff , Pedro Ortiz Suarez , Jose Javier Saiz , Iñaki Lacunza Castilla , Jorge Palomar-Giner , Alexander Shvets , Patrick Schramowski , Georg Rehm , Marta Villegas , Kristian Kersting

Topics

Artificial Intelligence > Learning Paradigms > Transfer Learning Natural Language Processing > Resources & Methods > Multilingual NLP Natural Language Processing > Resources & Methods > Text Representation

Keywords

corpus construction low-resource language multilingual dataset data preprocessing large language model web datum

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024