LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval

Reuben Tan; Huijuan Xu; Kate Saenko; Bryan A. Plummer

2021 WACV WACV 2021

LoGAN: Latent Graph Co-Attention Network for Weakly-Supervised Video Moment Retrieval

Abstract

The goal of weakly-supervised video moment retrieval is to localize the video segment most relevant to a description without access to temporal annotations during training. Prior work uses co-attention mechanisms to understand relationships between the vision and language data, but they lack contextual information between video frames that can be useful to determine how well a segment relates to the query. To address this, we propose an efficient Latent Graph Co-Attention Network (LoGAN) that exploits fine-grained frame-by-word interactions to jointly reason about the correspondences between all possible pairs of frames, providing context cues absent in prior work. Experiments on the DiDeMo and Charades-STA datasets demonstrate the effectiveness of our approach, where we improve Recall@1 by 5-20% over prior weakly-supervised methods, even boasting an 11% gain over strongly-supervised methods on DiDeMo, while also using significantly fewer model parameters than other co-attention mechanisms.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Reuben Tan , Huijuan Xu , Kate Saenko , Bryan A. Plummer

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Machine Learning > Learning Types > Weakly Supervised Learning

Keywords

weakly supervised learning multimodal learning latent graph co-attention mechanism video moment retrieval

Download PDF

Related papers

Multimodal Humor Dataset: Predicting Laughter Tracks for Sitcoms 2021

Benchmark for Evaluating Pedestrian Action Prediction 2021

Regional Attention Networks With Context-Aware Fusion for Group Emotion Recognition 2021

Robust Lensless Image Reconstruction via PSF Estimation 2021

Improved Training of Generative Adversarial Networks Using Decision Forests 2021