An Open Dataset and Model for Language Identification

Laurie Burchell; Alexandra Birch; Nikolay Bogoychev; Kenneth Heafield

2023 ACL ACL 2023

An Open Dataset and Model for Language Identification

Abstract

AbstractLanguage identification (LID) is a fundamental step in many natural language processing pipelines. However, current LID systems are far from perfect, particularly on lower-resource languages. We present a LID model which achieves a macro-average F1 score of 0.93 and a false positive rate of 0.033% across 201 languages, outperforming previous work. We achieve this by training on a curated dataset of monolingual data, which we audit manually to ensure reliability. We make both the model and the dataset available to the research community. Finally, we carry out detailed analysis into our model’s performance, both in comparison to existing open models and by language class.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing and Speech & Audio

🧭 Keyword Pioneer — macro-average f1

🐣 Hot Topic Early Bird — multilingual classification

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio