MAVERIX: Multimodal Audio-Visual Evaluation and Recognition IndeX

Liuyue Xie; Avik Kuthiala; George Z Wei; Ce Zheng; Ananya Bal; Mosam Dabhi; Liting Wen; Taru Rustagi; Ethan Lai; Sushil Khyalia; Rohan Choudhury; Morteza Ziyadi; Xu Zhang; Hao Yang; László A. Jeni

2026 AAAI AAAI 2026

MAVERIX: Multimodal Audio-Visual Evaluation and Recognition IndeX

Abstract

Abstract We introduce MAVERIX (Multimodal Audio-Visual Evaluation and Recognition IndeX), a unified benchmark to probe video understanding in multimodal LLMs, encompassing video, audio, and text inputs with human performance baselines. Although recent advancements in audiovisual models have shown substantial progress, the field lacks a standardized evaluation framework to thoroughly assess their cross-modality comprehension performance. MAVERIX curates 2,556 questions from 700 videos, in the form of both multiple-choice and open-ended formats, explicitly designed to evaluate multimodal models through questions that necessitate tight integration of video and audio information, spanning a broad spectrum of agentic scenarios. MAVERIX uniquely provides models with questions that closely mimic the multimodal understanding experiences available to humans during decision-making processes. To our knowledge, MAVERIX is the first benchmark aimed explicitly at assessing comprehensive audiovisual integration in such granularity. Experiments with state-of-the-art models, including Qwen 2.5 Omni and Gemini 2.5 Flash-Lite, show performance around 64% accuracy, while human experts reach near-ceiling performance of 92.8%, exposing a substantial gap to human-level comprehension. With standardized evaluation protocols, a rigorously annotated pipeline, and a public toolkit, MAVERIX establishes a challenging testbed for advancing audiovisual multimodal intelligence, with the website publicly available below.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Liuyue Xie , Avik Kuthiala , George Z Wei , Ce Zheng , Ananya Bal , Mosam Dabhi , Liting Wen , Taru Rustagi , Ethan Lai , Sushil Khyalia , Rohan Choudhury , Morteza Ziyadi , Xu Zhang , Hao Yang , László A. Jeni

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Processing > Video Understanding

Keywords

benchmark evaluation multimodal learning video understanding audiovisual integration

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026