Bringing RNNs Back to Efficient Open-Ended Video Understanding

Weili Xu; Enxin Song; Wenhao Chai; Xuexiang Wen; Tian Ye; Gaoang Wang

2025 ICCV ICCV 2025

Bringing RNNs Back to Efficient Open-Ended Video Understanding

Abstract

The challenge of long video understanding lies in its high computational complexity and prohibitive memory cost, since the memory and computation required by transformer-based LLMs scale quadratically with input sequence length. We propose AuroraLong to address this challenge by replacing the LLM component in MLLMs with a linear RNN language model that handles input sequence of arbitrary length with constant-size hidden states. To further increase throughput and efficiency, we combine visual token merge with linear RNN models by reordering the visual tokens by their sizes in ascending order. Despite having only 2B parameters and being trained exclusively on public data, AuroraLong achieves performance comparable to Transformer-based models of similar size trained on private datasets across multiple video benchmarks. This demonstrates the potential of efficient, linear RNNs to democratize long video understanding by lowering its computational entry barrier. To our best knowledge, we are the first to use a linear RNN based LLM backbone in a LLaVA-like model for open-ended video understanding.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — long video processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Weili Xu , Enxin Song , Wenhao Chai , Xuexiang Wen , Tian Ye , Gaoang Wang

Topics

Machine Learning > Application Areas > Efficient Computing Deep Learning > Architectures > Transformers Deep Learning > Architectures > Neural Networks Computer Vision > Processing > Video Understanding Deep Learning > Optimization & Theory > Efficient Computing

Keywords

multimodal learning video understanding multimodal large language model linear rnn model efficiency large language model long video processing visual token merge

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025