VicTR: Video-conditioned Text Representations for Activity Recognition

Kumara Kahatapitiya; Anurag Arnab; Arsha Nagrani; Michael S. Ryoo

2024 CVPR CVPR 2024

VicTR: Video-conditioned Text Representations for Activity Recognition

Abstract

Vision-Language models (VLMs) have excelled in the image-domain--- especially in zero-shot settings--- thanks to the availability of vast pretraining data (i.e. paired image-text samples). However for videos such paired data is not as abundant. Therefore video-VLMs are usually designed by adapting pretrained image-VLMs to the video-domain instead of training from scratch. All such recipes rely on augmenting visual embeddings with temporal information (i.e. image --> video) often keeping text embeddings unchanged or even being discarded. In this paper we argue the contrary that better video-VLMs can be designed by focusing more on augmenting text rather than visual information. More specifically we introduce Video-conditioned Text Representations (VicTR): a form of text embeddings optimized w.r.t. visual embeddings creating a more-flexible contrastive latent space. Our model can further make use of freely-available semantic information in the form of visually-grounded auxiliary text (e.g. object or scene information). We evaluate our model on few-shot zero-shot (HMDB-51 UCF-101) short-form (Kinetics-400) and long-form (Charades) activity recognition benchmarks showing strong performance among video-VLMs.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Kumara Kahatapitiya , Anurag Arnab , Arsha Nagrani , Michael S. Ryoo

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Machine Learning > Learning Types > Zero-Shot Learning

Keywords

contrastive learning text representation video understanding activity recognition vision-language model

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024