GRIZAL: Generative Prior-guided Zero-Shot Temporal Action Localization

Onkar Kishor Susladkar; Gayatri Sudhir Deshmukh; Vandan Gorade; Sparsh Mittal

2024 EMNLP EMNLP 2024

GRIZAL: Generative Prior-guided Zero-Shot Temporal Action Localization

Abstract

AbstractZero-shot temporal action localization (TAL) aims to temporally localize actions in videos without prior training examples. To address the challenges of TAL, we offer GRIZAL, a model that uses multimodal embeddings and dynamic motion cues to localize actions effectively. GRIZAL achieves sample diversity by using large-scale generative models such as GPT-4 for generating textual augmentations and DALL-E for generating image augmentations. Our model integrates vision-language embeddings with optical flow insights, optimized through a blend of supervised and self-supervised loss functions. On ActivityNet, Thumos14 and Charades-STA datasets, GRIZAL greatly outperforms state-of-the-art zero-shot TAL models, demonstrating its robustness and adaptability across a wide range of video content. We will make all the models and code publicly available by open-sourcing them.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Onkar Kishor Susladkar , Gayatri Sudhir Deshmukh , Vandan Gorade , Sparsh Mittal

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Models > Generative Models Computer Vision > Analysis > Action Recognition Artificial Intelligence > Learning Paradigms > Zero-Shot Learning Deep Learning > Learning Types > Zero-Shot Learning Deep Learning > Models > Vision-Language Models

Keywords

zero-shot learning video understanding optical flow generative model vision-language model temporal action localization multimodal embedding

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024