2025 NAACL NAACL 2025

M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models

Abstract

AbstractCollecting instruction fine-tuning (IFT) data is a resource and time intensive task especially in multilingual setting where finding proficient native speakers is challenging. Moreover, traditional data collection is prone to privacy risks, toxicity and lacks scalability. While, fully synthetic datasets are a promising alternative, research on their use in multilingual domain is limited as existing approaches still rely on machine translation to improve multilingual performance. To bridge this gap we introduce M2Lingual, the first fully synthetic, multi-turn multilingual dataset having 175K conversations across 70 languages with a balanced mix of high, low and mid-resourced languages. M2Lingual is constructed using a cost-efficient and scalable method that uses our novel two-step Evol prompt taxonomy to transform a small set of human written instructions to complex and challenging conversations. Results across three model families, six baseline datasets and evaluation spanning 31 languages demonstrates the effectiveness of M2Lingual over other datasets.

🧭 Keyword Pioneer — multilingual dialogue generation
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio