Enhancing Tuvan Language Resources through the FLORES Dataset

Ali Kuzhuget; Airana Mongush; Nachyn-Enkhedorzhu Oorzhak

2024 EMNLP EMNLP 2024

Enhancing Tuvan Language Resources through the FLORES Dataset

Abstract

AbstractFLORES is a benchmark dataset designed for evaluating machine translation systems, partic- ularly for low-resource languages. This paper, conducted as a part of Open Language Data Ini- tiative (OLDI) shared task, presents our contri- bution to expanding the FLORES dataset with high-quality translations from Russian to Tu- van, an endangered Turkic language. Our ap- proach combined the linguistic expertise of na- tive speakers to ensure both accuracy and cul- tural relevance in the translations. This project represents a significant step forward in support- ing Tuvan as a low-resource language in the realm of natural language processing (NLP) and machine translation (MT).

🧭 Keyword Pioneer — flores benchmark

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Ali Kuzhuget , Airana Mongush , Nachyn-Enkhedorzhu Oorzhak

Topics

Natural Language Processing > Applications > Machine Translation Natural Language Processing > Resources & Methods > Multilingual NLP

Keywords

machine translation low-resource language translation dataset flores benchmark tuvan language

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024