How Reliable is Multilingual LLM-as-a-Judge?

Xiyan Fu; Wei Liu

2025 EMNLP EMNLP 2025

How Reliable is Multilingual LLM-as-a-Judge?

Abstract

AbstractLLM-as-a-Judge has emerged as a popular evaluation strategy, where advanced large language models assess generation results in alignment with human instructions. While these models serve as a promising alternative to human annotators, their reliability in multilingual evaluation remains uncertain. To bridge this gap, we conduct a comprehensive analysis of multilingual LLM-as-a-Judge. Specifically, we evaluate five models from different model families across five diverse tasks involving 25 languages. Our findings reveal that LLMs struggle to achieve consistent judgment results across languages, with an average Fleiss’ Kappa of approximately 0.3, and some models performing even worse. To investigate the cause of inconsistency, we analyze various influencing factors. We observe that consistency varies significantly across languages, with particularly poor performance in low-resource languages. Additionally, we find that neither training on multilingual data nor increasing model scale directly improves judgment consistency. These findings suggest that LLMs are not yet reliable for evaluating multilingual predictions. Our work provides valuable insights into the limitations of multilingual LLM-as-a-Judge, and sheds light on future research.

❓ The Questioner

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — evaluation consistency

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Xiyan Fu , Wei Liu

Topics

Artificial Intelligence > Core AI > Interpretability Natural Language Processing > Resources & Methods > Large Language Models Artificial Intelligence > Core AI > Large Language Models Natural Language Processing > Applications > Text Generation Machine Learning > Learning Types > Evaluation

Keywords

low-resource language multilingual evaluation judgment consistency evaluation reliability evaluation consistency

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025