Data Selection for Unsupervised Translation of German–Upper Sorbian

Lukas Edman; Antonio Toral; Gertjan van Noord

2020 EMNLP EMNLP 2020

Data Selection for Unsupervised Translation of German–Upper Sorbian

Abstract

AbstractThis paper describes the methods behind the systems submitted by the University of Groningen for the WMT 2020 Unsupervised Machine Translation task for German–Upper Sorbian. We investigate the usefulness of data selection in the unsupervised setting. We find that we can perform data selection using a pretrained model and show that the quality of a set of sentences or documents can have a great impact on the performance of the UNMT system trained on it. Furthermore, we show that document-level data selection should be preferred for training the XLM model when possible. Finally, we show that there is a trade-off between quality and quantity of the data used to train UNMT systems.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — document-level training

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Lukas Edman , Antonio Toral , Gertjan van Noord

Topics

Machine Learning > Learning Types > Unsupervised Learning Natural Language Processing > Applications > Machine Translation Deep Learning > Learning Types > Unsupervised Learning

Keywords

parallel corpus data selection cross-lingual model unsupervised machine translation document-level training

Download PDF

Related papers

Fast semantic parsing with well-typedness guarantees 2020

Detecting Objectifying Language in Online Professor Reviews 2020

Analogous Process Structure Induction for Sub-event Sequence Prediction 2020

Aspect Sentiment Classification with Aspect-Specific Opinion Spans 2020

Robust and Interpretable Grounding of Spatial References with Relation Networks 2020