Localizing Natural Language in Videos

Jingyuan Chen; Lin Ma; Xinpeng Chen; Zequn Jie; Jiebo Luo

2019 AAAI AAAI 2019

Localizing Natural Language in Videos

Abstract

Abstract In this paper, we consider the task of natural language video localization (NLVL): given an untrimmed video and a natural language description, the goal is to localize a segment in the video which semantically corresponds to the given natural language description. We propose a localizing network (LNet), working in an end-to-end fashion, to tackle the NLVL task. We first match the natural sentence and video sequence by cross-gated attended recurrent networks to exploit their fine-grained interactions and generate a sentence-aware video representation. A self interactor is proposed to perform crossframe matching, which dynamically encodes and aggregates the matching evidences. Finally, a boundary model is proposed to locate the positions of video segments corresponding to the natural sentence description by predicting the starting and ending points of the segment. Extensive experiments conducted on the public TACoS and DiDeMo datasets demonstrate that our proposed model performs effectively and efficiently against the state-of-the-art approaches.

🚀 Conference Pioneer — AAAI 2019

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — natural language video localization

🐣 Hot Topic Early Bird — boundary detection

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jingyuan Chen , Lin Ma , Xinpeng Chen , Zequn Jie , Jiebo Luo

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Deep Learning > Architectures > Neural Networks Natural Language Processing > Applications > Natural Language Inference Machine Learning > Learning Types > Multi-Modal Learning Computer Vision > Analysis > Video Understanding Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

attention mechanism video understanding natural language understanding boundary detection natural language cross-modal matching temporal localization video localization natural language video localization sentence-aware representation boundary prediction sentence-aware video representation cross-gated attention

Download PDF

Related papers

Cooperative Multimodal Approach to Depression Detection in Twitter 2019

Learning to Align Question and Answer Utterances in Customer Service Conversation with Recurrent Pointer Networks 2019

Community Detection in Social Networks Considering Topic Correlations 2019

Session-Based Recommendation with Graph Neural Networks 2019

Blameworthiness in Multi-Agent Settings 2019