Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering

Philipp Koehn; Huda Khayrallah; Kenneth Heafield; Mikel L. Forcada

2018 EMNLP EMNLP 2018

Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering

Abstract

AbstractWe posed the shared task of assigning sentence-level quality scores for a very noisy corpus of sentence pairs crawled from the web, with the goal of sub-selecting 1% and 10% of high-quality data to be used to train machine translation systems. Seventeen participants from companies, national research labs, and universities participated in this task.

🧭 Keyword Pioneer — parallel corpus filtering

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing

Authors

Philipp Koehn , Huda Khayrallah , Kenneth Heafield , Mikel L. Forcada

Topics

Natural Language Processing > Applications > Machine Translation Natural Language Processing > Resources & Methods > Text Representation

Keywords

parallel corpus filtering sentence-level quality corpus quality quality scoring machine translation training

Download PDF

Related papers

Speeding Up Neural Machine Translation Decoding by Cube Pruning 2018

Limitations in learning an interpreted language with recurrent models 2018

Results of the sixth edition of the BioASQ Challenge 2018

Neural Segmental Hypergraphs for Overlapping Mention Recognition 2018

Hybrid Neural Attention for Agreement/Disagreement Inference in Online Debates 2018