UnLoc: A Unified Framework for Video Localization Tasks

Shen Yan; Xuehan Xiong; Arsha Nagrani; Anurag Arnab; Zhonghao Wang; Weina Ge; David Ross; Cordelia Schmid

2023 ICCV ICCV 2023

UnLoc: A Unified Framework for Video Localization Tasks

Abstract

While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task. We design a new approach for this called UnLoc, which uses pretrained image and text towers, and feeds tokens to a video-text fusion model. The output of the fusion module are then used to construct a feature pyramid in which each level connects to a head to predict a per-frame relevancy score and start/end time displacements. Unlike previous works, our architecture enables Moment Retrieval, Temporal Localization, and Action Segmentation with a single stage model, without the need for action proposals, motion based pretrained features or representation masking. Unlike specialized models, we achieve state of the art results on all three different localization tasks with a unified approach. Code is available at: https://github.com/google-research/scenic.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — video-text fusion

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Shen Yan , Xuehan Xiong , Arsha Nagrani , Anurag Arnab , Zhonghao Wang , Weina Ge , David Ross , Cordelia Schmid

Topics

Computer Vision > Analysis > Action Recognition Computer Vision > Analysis > Video Understanding Deep Learning > Learning Types > Multimodal Learning Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

video understanding moment retrieval action segmentation temporal localization video localization video-text fusion

Download PDF

Related papers

PVT++: A Simple End-to-End Latency-Aware Visual Tracking Framework 2023

Periodically Exchange Teacher-Student for Source-Free Object Detection 2023

Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations 2023

Minimal Solutions to Uncalibrated Two-view Geometry with Known Epipoles 2023

3D Neural Embedding Likelihood: Probabilistic Inverse Graphics for Robust 6D Pose Estimation 2023