2024 COLING COLING 2024

Multilinguality or Back-translation? A Case Study with Estonian

Abstract

AbstractMachine translation quality is highly reliant on large amounts of training data, and, when a limited amount of parallel data is available, synthetic back-translated or multilingual data can be used in addition. In this work, we introduce SynEst, a synthetic corpus of translations from 11 languages into Estonian which totals over 1 billion sentence pairs. Using this corpus, we investigate whether adding synthetic or English-centric additional data yields better translation quality for translation directions that do not include English. Our results show that while both strategies are effective, synthetic data gives better results. Our final models improve the performance of the baseline No Language Left Behind model while retaining its source-side multilinguality.

โ“ The Questioner
๐ŸŒ‰ Interdisciplinary Bridge โ€” Artificial Intelligence and Machine Learning and Natural Language Processing
๐Ÿ Cross-Pollinator โ€” Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio