Improving the Quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics

Aloka Fernando; Nisansa de Silva; Menan Velayuthan; Charitha Rathnayake; Surangika Ranathunga

2025 EMNLP EMNLP 2025

Improving the Quality of Web-mined Parallel Corpora of Low-Resource Languages using Debiasing Heuristics

Abstract

AbstractParallel Data Curation (PDC) techniques aim to filter out noisy parallel sentences from web-mined corpora. Ranking sentence pairs using similarity scores on sentence embeddings derived from Pre-trained Multilingual Language Models (multiPLMs) is the most common PDC technique. However, previous research has shown that the choice of the multiPLM significantly impacts the quality of the filtered parallel corpus, and the Neural Machine Translation (NMT) models trained using such data show a disparity across multiPLMs. This paper shows that this disparity is due to different multiPLMs being biased towards certain types of sentence pairs, which are treated as noise from an NMT point of view. We show that such noisy parallel sentences can be removed to a certain extent by employing a series of heuristics. The NMT models, trained using the curated corpus, lead to producing better results while minimizing the disparity across multiPLMs. We publicly release the source code and the curated datasets

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — debiasing heuristics

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Aloka Fernando , Nisansa de Silva , Menan Velayuthan , Charitha Rathnayake , Surangika Ranathunga

Topics

Machine Learning > Application Areas > Domain Adaptation Natural Language Processing > Applications > Machine Translation Natural Language Processing > Resources & Methods > Multilingual NLP

Keywords

neural machine translation parallel corpus low-resource language data curation multilingual language model corpus filtering debiasing heuristics

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025