2024 EMNLP EMNLP 2024

Machine Translation Metrics Are Better in Evaluating Linguistic Errors on LLMs than on Encoder-Decoder Systems

Abstract

AbstractThis year’s MT metrics challenge set submission by DFKI expands on previous years’ linguistically motivated challenge sets. It includes 137,000 items extracted from 100 MT systems for the two language directions (English to German, English to Russian), covering more than 100 linguistically motivated phenomena organized into 14 linguistic categories. The metrics with the statistically significant best performance in our linguistically motivated analysis are MetricX-24-Hybrid and MetricX-24 for English to German, and MetricX-24 for English to Russian. Metametrics and XCOMET are in the next ranking positions in both language pairs. Metrics are more accurate in detecting linguistic errors in translations by large language models (LLMs) than in translations based on the encoder-decoder neural machine translation (NMT) architecture. Some of the most difficult phenomena for the metrics to score are the transitive past progressive, multiple connectors, and the ditransitive simple future I for English to German, and pseudogapping, contact clauses, and cleft sentences for English to Russian. Despite its overall low performance, the LLM-based metric Gemba performs best in scoring German negation errors.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio