MUSE: Mamba Is Efficient Multi-scale Learner for Text-video Retrieval

Haoran Tang; Meng Cao; Jinfa Huang; Ruyang Liu; Peng Jin; Ge Li; Xiaodan Liang

2025 AAAI AAAI 2025

MUSE: Mamba Is Efficient Multi-scale Learner for Text-video Retrieval

Abstract

Abstract Text-Video Retrieval (TVR) aims to align and associate relevant video content with corresponding natural language queries. Most existing TVR methods are based on large-scale pre-trained vision-language models (e.g., CLIP). However, due to CLIP's inherent plain structure, few TVR methods explore the multi-scale representations which offer richer contextual information for a more thorough understanding. To this end, we propose MUSE, a multi-scale mamba with linear computational complexity for efficient cross-resolution modeling. Specifically, the multi-scale representations are generated by applying a feature pyramid on the last single-scale feature map. Then, we employ the Mamba structure as an efficient multi-scale learner to jointly learn scale-wise representations. Furthermore, we conduct comprehensive studies to investigate different model structures and designs. Extensive results on three popular benchmarks have validated the superiority of MUSE.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🧭 Keyword Pioneer — cross-resolution modeling

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio

Authors

Haoran Tang , Meng Cao , Jinfa Huang , Ruyang Liu , Peng Jin , Ge Li , Xiaodan Liang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Applications > Information Retrieval

Keywords

multi-scale representation text-video retrieval feature pyramid cross-resolution modeling

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025