M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models
Abstract
Multilingual capability is a crucial requirement for large multimodal models, which are increasingly deployed across diverse countries and languages. However, most existing benchmarks for multilingual multimodal reasoning fail to effectively distinguish models of different strengths; in fact, even text-only language models without visual capabilities can often achieve high scores. As a result, the comprehensive evaluation of state-of-the-art multilingual multimodal models remains underexplored. In this work, we present M4U, a novel and challenging benchmark designed to evaluate multilingual, multi-discipline multimodal understanding and reasoning. M4U comprises 10k samples spanning 64 disciplines across 16 subfields in Science, Engineering, and Healthcare, covering six languages. Using this benchmark, we conduct extensive evaluations of leading Large Multimodal Models (LMMs) and Large Language Models (LLMs) augmented with external tools. Our results reveal that even the strongest LMMs exhibit pronounced language preferences and struggle with reasoning tasks that require integrating multilingual information across visual and textual modalities. In particular, performance drops markedly when models are prompted with cross-lingual multimodal questions, highlighting significant gaps in current multilingual multimodal reasoning capabilities.