OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

Yuxuan Wang; Yueqian Wang; Bo Chen; Tong Wu; Dongyan Zhao; Zilong Zheng

2025 CVPR CVPR 2025

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

Abstract

The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — proactive reasoning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yuxuan Wang , Yueqian Wang , Bo Chen , Tong Wu , Dongyan Zhao , Zilong Zheng

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Processing > Video Understanding Artificial Intelligence > Core AI > Large Language Models Deep Learning > Models > Large Language Models Computer Vision > Analysis > Video Understanding Deep Learning > Learning Types > Multi-Modal Learning

Keywords

benchmark evaluation multimodal learning video understanding multimodal language model multi-modal interaction multi-modal language model streaming video large language model proactive reasoning

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025