Improving Evaluation of Document-level Machine Translation Quality Estimation

Yvette Graham; Qingsong Ma; Timothy Baldwin; Qun Liu; Carla Parra; Carolina Scarton

2017 EACL EACL 2017

Improving Evaluation of Document-level Machine Translation Quality Estimation

Abstract

AbstractMeaningful conclusions about the relative performance of NLP systems are only possible if the gold standard employed in a given evaluation is both valid and reliable. In this paper, we explore the validity of human annotations currently employed in the evaluation of document-level quality estimation for machine translation (MT). We demonstrate the degree to which MT system rankings are dependent on weights employed in the construction of the gold standard, before proposing direct human assessment as a valid alternative. Experiments show direct assessment (DA) scores for documents to be highly reliable, achieving a correlation of above 0.9 in a self-replication experiment, in addition to a substantial estimated cost reduction through quality controlled crowd-sourcing. The original gold standard based on post-edits incurs a 10–20 times greater cost than DA.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

📈 Trend Setter — Evaluation

🧭 Keyword Pioneer — document-level translation

🐣 Hot Topic Early Bird — document-level translation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Yvette Graham , Qingsong Ma , Timothy Baldwin , Qun Liu , Carla Parra , Carolina Scarton

Topics

Natural Language Processing > Applications > Machine Translation Natural Language Processing > Resources & Methods > Language Modeling Machine Learning > Optimization & Theory > Evaluation Machine Learning > Learning Types > Evaluation

Keywords

machine translation document-level translation evaluation methodology quality estimation evaluation metrics inter-rater reliability human assessment document-level evaluation gold standard

Download PDF

Related papers

Cross-Lingual Dependency Parsing with Late Decoding for Truly Low-Resource Languages 2017

Learning and Knowledge Transfer with Memory Networks for Machine Comprehension 2017

Is this a Child, a Girl or a Car? Exploring the Contribution of Distributional Similarity to Learning Referential Word Meanings 2017

Building Web-Interfaces for Vector Semantic Models with the WebVectors Toolkit 2017

Assessing Convincingness of Arguments in Online Debates with Limited Number of Features 2017