MS-COMET: More and Better Human Judgements Improve Metric Performance

Tom Kocmi; Hitokazu Matsushita; Christian Federmann

2022 EMNLP EMNLP 2022

MS-COMET: More and Better Human Judgements Improve Metric Performance

Abstract

AbstractWe develop two new metrics that build on top of the COMET architecture. The main contribution is collecting a ten-times larger corpus of human judgements than COMET and investigating how to filter out problematic human judgements. We propose filtering human judgements where human reference is statistically worse than machine translation. Furthermore, we average scores of all equal segments evaluated multiple times. The results comparing automatic metrics on source-based DA and MQM-style human judgement show state-of-the-art performance on a system-level pair-wise system ranking. We release both of our metrics for public use.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Mathematics & Optimization and Natural Language Processing

🧭 Keyword Pioneer — metric performance

🐣 Hot Topic Early Bird — human judgment

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio