An Unsupervised System for Parallel Corpus Filtering

Viktor Hangya; Alexander Fraser

2018 EMNLP EMNLP 2018

An Unsupervised System for Parallel Corpus Filtering

Abstract

AbstractIn this paper we describe LMU Munich’s submission for the WMT 2018 Parallel Corpus Filtering shared task which addresses the problem of cleaning noisy parallel corpora. The task of mining and cleaning parallel sentences is important for improving the quality of machine translation systems, especially for low-resource languages. We tackle this problem in a fully unsupervised fashion relying on bilingual word embeddings created without any bilingual signal. After pre-filtering noisy data we rank sentence pairs by calculating bilingual sentence-level similarities and then remove redundant data by employing monolingual similarity as well. Our unsupervised system achieved good performance during the official evaluation of the shared task, scoring only a few BLEU points behind the best systems, while not requiring any parallel training data.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — monolingual similarity

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Viktor Hangya , Alexander Fraser

Topics

Machine Learning > Learning Types > Unsupervised Learning Natural Language Processing > Applications > Machine Translation Natural Language Processing > Resources & Methods > Text Representation Machine Learning > Learning Types > Representation Learning

Keywords

unsupervised learning machine translation parallel corpus bilingual word embedding sentence pair monolingual similarity bilingual sentence similarity

Download PDF

Related papers

Speeding Up Neural Machine Translation Decoding by Cube Pruning 2018

Limitations in learning an interpreted language with recurrent models 2018

Results of the sixth edition of the BioASQ Challenge 2018

Neural Segmental Hypergraphs for Overlapping Mention Recognition 2018

Hybrid Neural Attention for Agreement/Disagreement Inference in Online Debates 2018