A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models

Iwona Christop; Mateusz Czyżnikiewicz; Paweł Skórzewski; Łukasz Bondaruk; Jakub Kubiak; Marcin Lewandowski; Marek Kubis

2026 EACL EACL 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models

Abstract

AbstractThe present benchmarks for testing the audio modality of multimodal large language models concentrate on testing various audio tasks such as speaker diarization or gender identification in isolation. Whether a multimodal model can answer the questions that require reasoning skills to combine audio tasks of different categories cannot be verified with their use. To address this issue, we propose Audio Reasoning Tasks (ART), a new benchmark for assessing the ability of multimodal models to solve problems that require reasoning over audio signal.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Iwona Christop , Mateusz Czyżnikiewicz , Paweł Skórzewski , Łukasz Bondaruk , Jakub Kubiak , Marcin Lewandowski , Marek Kubis

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Multimodal Learning

Keywords

benchmark evaluation question answering multimodal learning large language model audio reasoning

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026

From Paper to Structured JSON: An Agentic AI Workflow for Compliant BMR Digital Transformation 2026