Expanding the FLORES+ Multilingual Benchmark with Translations for Aragonese, Aranese, Asturian, and Valencian

Juan Antonio Pérez-Ortiz; Felipe Sánchez-Martínez; Víctor M. Sánchez-Cartagena; Miquel Esplà-Gomis; Aaron Galiano Jimenez; Antoni Oliver; Claudi Aventín-Boya; Alejandro Pardos; Cristina Valdés; Jusèp Loís Sans Socasau; Juan Pablo Martínez

2024 EMNLP EMNLP 2024

Expanding the FLORES+ Multilingual Benchmark with Translations for Aragonese, Aranese, Asturian, and Valencian

Abstract

AbstractIn this paper, we describe the process of creating the FLORES+ datasets for several Romance languages spoken in Spain, namely Aragonese, Aranese, Asturian, and Valencian. The Aragonese and Aranese datasets are entirely new additions to the FLORES+ multilingual benchmark. An initial version of the Asturian dataset was already available in FLORES+, and our work focused on a thorough revision. Similarly, FLORES+ included a Catalan dataset, which we adapted to the Valencian variety spoken in the Valencian Community. The development of the Aragonese, Aranese, and revised Asturian FLORES+ datasets was undertaken as part of a WMT24 shared task on translation into low-resource languages of Spain.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🐣 Hot Topic Early Bird — multilingual benchmark

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Juan Antonio Pérez-Ortiz , Felipe Sánchez-Martínez , Víctor M. Sánchez-Cartagena , Miquel Esplà-Gomis , Aaron Galiano Jimenez , Antoni Oliver , Claudi Aventín-Boya , Alejandro Pardos , Cristina Valdés , Jusèp Loís Sans Socasau , Juan Pablo Martínez

Topics

Machine Learning > Application Areas > Domain Adaptation Natural Language Processing > Applications > Machine Translation Natural Language Processing > Resources & Methods > Multilingual NLP Natural Language Processing > Resources & Methods > Language Modeling Machine Learning > Application Areas > Evaluation

Keywords

dataset creation machine translation parallel corpus low-resource language multilingual benchmark romance language translation dataset

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024