PARME: Parallel Corpora for Low-Resourced Middle Eastern Languages

Sina Ahmadi; Rico Sennrich; Erfan Karami; Ako Marani; Parviz Fekrazad; Gholamreza Akbarzadeh Baghban; Hanah Hadi; Semko Heidari; Mahîr Dogan; Pedram Asadi; Dashne Bashir; Mohammad Amin Ghodrati; Kourosh Amini; Zeynab Ashourinezhad; Mana Baladi; Farshid Ezzati; Alireza Ghasemifar; Daryoush Hosseinpour; Behrooz Abbaszadeh; Amin Hassanpour; Bahaddin Jalal Hamaamin; Saya Kamal Hama; Ardeshir Mousavi; Sarko Nazir Hussein; Isar Nejadgholi; Mehmet Ölmez; Horam Osmanpour; Rashid Roshan Ramezani; Aryan Sediq Aziz; Ali Salehi Sheikhalikelayeh; Mohammadreza Yadegari; Kewyar Yadegari; Sedighe Zamani Roodsari

2025 ACL ACL 2025

PARME: Parallel Corpora for Low-Resourced Middle Eastern Languages

Abstract

AbstractThe Middle East is characterized by remarkable linguistic diversity, with over 400 million inhabitants speaking more than 60 languages across multiple language families. This study presents a pioneering work in developing the first parallel corpora for eight severely under-resourced varieties in the region–PARME, addressing fundamental challenges in low-resource scenarios including non-standardized writing and dialectal complexity. Through an extensive community-driven initiative, volunteers contributed to the creation of over 36,000 translated sentences, marking a significant milestone in resource development. We evaluate machine translation capabilities through zero-shot approaches and fine-tuning experiments with pretrained machine translation models and provide a comprehensive analysis of limitations. Our findings reveal significant gaps in existing technologies for processing the selected languages, highlighting critical areas for improvement in language technology for Middle Eastern languages.

👥 Mega-Team — 33 authors

🧭 Keyword Pioneer — dialectal complexity

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

🌉 Interdisciplinary Bridge — Interdisciplinary and Machine Learning and Natural Language Processing