Relation-aware Video Reading Comprehension for Temporal Language Grounding

Jialin Gao; Xin Sun; Mengmeng Xu; Xi Zhou; Bernard Ghanem

2021 EMNLP EMNLP 2021

Relation-aware Video Reading Comprehension for Temporal Language Grounding

Abstract

AbstractTemporal language grounding in videos aims to localize the temporal span relevant to the given query sentence. Previous methods treat it either as a boundary regression task or a span extraction task. This paper will formulate temporal language grounding into video reading comprehension and propose a Relation-aware Network (RaNet) to address it. This framework aims to select a video moment choice from the predefined answer set with the aid of coarse-and-fine choice-query interaction and choice-choice relation construction. A choice-query interactor is proposed to match the visual and textual information simultaneously in sentence-moment and token-moment levels, leading to a coarse-and-fine cross-modal interaction. Moreover, a novel multi-choice relation constructor is introduced by leveraging graph convolution to capture the dependencies among video moment choices for the best choice selection. Extensive experiments on ActivityNet-Captions, TACoS, and Charades-STA demonstrate the effectiveness of our solution. Codes will be available at https://github.com/Huntersxsx/RaNet.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — video reading comprehension

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning

Authors

Jialin Gao , Xin Sun , Mengmeng Xu , Xi Zhou , Bernard Ghanem

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Architectures > Graph Neural Networks Computer Vision > Analysis > Video Understanding Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

graph convolution graph convolution network video moment localization cross-modal interaction temporal language grounding video reading comprehension moment selection

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021