Streamlined Dense Video Captioning

Jonghwan Mun; Linjie Yang; Zhou Ren; Ning Xu; Bohyung Han

2019 CVPR CVPR 2019

Streamlined Dense Video Captioning

Abstract

Dense video captioning is an extremely challenging task since accurate and coherent description of events in a video requires holistic understanding of video contents as well as contextual reasoning of individual events. Most existing approaches handle this problem by first detecting event proposals from a video and then captioning on a subset of the proposals. As a result, the generated sentences are prone to be redundant or inconsistent since they fail to consider temporal dependency between events. To tackle this challenge, we propose a novel dense video captioning framework, which models temporal dependency across events in a video explicitly and leverages visual and linguistic context from prior events for coherent storytelling. This objective is achieved by 1) integrating an event sequence generation network to select a sequence of event proposals adaptively, and 2) feeding the sequence of event proposals to our sequential video captioning network, which is trained by reinforcement learning with two-level rewards---at both event and episode levels---for better context modeling. The proposed technique achieves outstanding performances on ActivityNet Captions dataset in most metrics.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing and Reinforcement Learning

🧭 Keyword Pioneer — event proposal generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jonghwan Mun , Linjie Yang , Zhou Ren , Ning Xu , Bohyung Han

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Generation > Video Generation Natural Language Processing > Generation > Text Generation Reinforcement Learning > Methods > Deep RL Deep Learning > Learning Types > Reinforcement Learning

Keywords

reinforcement learning context modeling sequential modeling temporal dependency dense video captioning event proposal event proposal generation sequential captioning

Download PDF

Related papers

Fast Single Image Reflection Suppression via Convex Optimization 2019

Learning Video Representations From Correspondence Proposals 2019

ATOM: Accurate Tracking by Overlap Maximization 2019

Visual Tracking via Adaptive Spatially-Regularized Correlation Filters 2019

Edge-Labeling Graph Neural Network for Few-Shot Learning 2019