MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Bo He; Hengduo Li; Young Kyun Jang; Menglin Jia; Xuefei Cao; Ashish Shah; Abhinav Shrivastava; Ser-Nam Lim

2024 CVPR CVPR 2024

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

Abstract

With the success of large language models (LLMs) integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However existing LLM-based large multimodal models (e.g. Video-LLaMA VideoChat) can only take in a limited number of frames for short video understanding. In this study we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks such as long-video understanding video question answering and video captioning and our model can achieve state-of-the-art performances across multiple datasets.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Bo He , Hengduo Li , Young Kyun Jang , Menglin Jia , Xuefei Cao , Ashish Shah , Abhinav Shrivastava , Ser-Nam Lim

Topics

Artificial Intelligence > Core AI > Memory Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Learning Types > Multi-Modal Learning Artificial Intelligence > Core AI > Multi-Modal Learning Deep Learning > Models > Vision-Language Models

Keywords

multimodal learning video understanding large multimodal model vision-language model memory bank large language model long-term video

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024