Streaming VideoLLMs for Real-Time Procedural Video Understanding

Dibyadip Chatterjee; Edoardo Remelli; Yale Song; Bugra Tekin; Abhay Mittal; Bharat Bhatnagar; Necati Cihan Camgoz; Shreyas Hampali; Eric Sauser; Shugao Ma; Angela Yao; Fadime Sener

2025 ICCV ICCV 2025

Streaming VideoLLMs for Real-Time Procedural Video Understanding

Abstract

We introduce ProVideLLM, an end-to-end framework for real-time procedural video understanding. ProVideLLM integrates a multimodal cache configured to store two types of tokens -- verbalized text tokens, which provide compressed textual summaries of long-term observations, and visual tokens, encoded with DETR-QFormer to capture fine-grained details from short-term observations. This design reduces token count by 22xover existing methods when representing one hour of long-term observations while effectively encoding fine-granularity of the present. By interleaving these tokens in our multimodal cache, ProVideLLM achieves sub-linear scaling of memory and compute with video length, ensuring per-frame streaming inference at 10 FPS and streaming dialogue at 25 FPS, with a minimal 2GB GPU memory footprint. ProVideLLM also sets new state-of-the-art results on six procedural tasks across four datasets.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Dibyadip Chatterjee , Edoardo Remelli , Yale Song , Bugra Tekin , Abhay Mittal , Bharat Bhatnagar , Necati Cihan Camgoz , Shreyas Hampali , Eric Sauser , Shugao Ma , Angela Yao , Fadime Sener

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Processing > Video Understanding Deep Learning > Models > Large Language Models Machine Learning > Learning Types > Multimodal Learning

Keywords

multimodal learning video understanding streaming inference real-time inference large language model procedural video

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025