Automating Behavioral Testing in Machine Translation

Javier Ferrando; Matthias Sperber; Hendra Setiawan; Dominic Telaar; Saša Hasan

2023 EMNLP EMNLP 2023

Automating Behavioral Testing in Machine Translation

Abstract

AbstractBehavioral testing in NLP allows fine-grained evaluation of systems by examining their linguistic capabilities through the analysis of input-output behavior. Unfortunately, existing work on behavioral testing in Machine Translation (MT) is currently restricted to largely handcrafted tests covering a limited range of capabilities and languages. To address this limitation, we propose to use Large Language Models (LLMs) to generate a diverse set of source sentences tailored to test the behavior of MT models in a range of situations. We can then verify whether the MT model exhibits the expected behavior through matching candidate sets that are also generated using LLMs. Our approach aims to make behavioral testing of MT systems practical while requiring only minimal human effort. In our experiments, we apply our proposed evaluation framework to assess multiple available MT systems, revealing that while in general pass-rates follow the trends observable from traditional accuracy-based metrics, our method was able to uncover several important differences and potential bugs that go unnoticed when relying only on accuracy.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🐣 Hot Topic Early Bird — evaluation framework

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Javier Ferrando , Matthias Sperber , Hendra Setiawan , Dominic Telaar , Saša Hasan

Topics

Artificial Intelligence > Core AI > Interpretability Natural Language Processing > Applications > Machine Translation Artificial Intelligence > Core AI > Large Language Models

Keywords

machine translation evaluation framework behavioral testing automatic evaluation linguistic capability large language model

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023