The JHU Parallel Corpus Filtering Systems for WMT 2018

Huda Khayrallah; Hainan Xu; Philipp Koehn

2018 EMNLP EMNLP 2018

The JHU Parallel Corpus Filtering Systems for WMT 2018

Abstract

AbstractThis work describes our submission to the WMT18 Parallel Corpus Filtering shared task. We use a slightly modified version of the Zipporah Corpus Filtering toolkit (Xu and Koehn, 2017), which computes an adequacy score and a fluency score on a sentence pair, and use a weighted sum of the scores as the selection criteria. This work differs from Zipporah in that we experiment with using the noisy corpus to be filtered to compute the combination weights, and thus avoids generating synthetic data as in standard Zipporah.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — adequacy score

🐣 Hot Topic Early Bird — data quality

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Huda Khayrallah , Hainan Xu , Philipp Koehn

Topics

Natural Language Processing > Applications > Machine Translation Machine Learning > Learning Types > Representation Learning Machine Learning > Learning Types > Classification

Keywords

parallel corpus data quality sentence scoring corpus filtering sentence pair adequacy score fluency score

Download PDF

Related papers

Speeding Up Neural Machine Translation Decoding by Cube Pruning 2018

Limitations in learning an interpreted language with recurrent models 2018

Results of the sixth edition of the BioASQ Challenge 2018

Neural Segmental Hypergraphs for Overlapping Mention Recognition 2018

Hybrid Neural Attention for Agreement/Disagreement Inference in Online Debates 2018