2026 EACL EACL 2026

GlobLingDiv: A global dataset linking linguistic diversity and digital support to reveal landscapes with under-resourced languages for NLP

Abstract

AbstractLinguistic diversity is increasingly under pressure globally and is becoming ever more relevant in digital contexts, where many languages remain structurally under-resourced, limiting access to language technologies and inhibiting equitable NLP development. To support linguistic diversity, publicly available data are needed that capture both the number of languages spoken and the distribution of speakers across them. We introduce GlobLingDiv, a database that uses country-level speaker distributions to derive language richness and entropy-based diversity measures, alongside a population-weighted digital language support measure. Applying these metrics globally, we examine the association between linguistic diversity and digital support conditions. The results reveal a substantial imbalance: highly diverse linguistic landscapes show comparatively low digital support, underscoring the need for more inclusive NLP environments.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing
🧭 Keyword Pioneer — digital support
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Security & Privacy, Speech & Audio