TVT: Two-View Transformer Network for Video Captioning

Ming Chen; Yingming Li; Zhongfei Zhang; Siyu Huang

2018 ACML ACML 2018

TVT: Two-View Transformer Network for Video Captioning

Abstract

Video captioning is a task of automatically generating the natural text description of a given video. There are two main challenges in video captioning under the context of an encoder-decoder framework: 1) How to model the sequential information; 2) How to combine the modalities including video and text. For challenge 1), the recurrent neural networks (RNNs) based methods are currently the most common approaches for learning temporal representations of videos, while they suffer from a high computational cost. For challenge 2), the features of different modalities are often roughly concatenated together without insightful discussion. In this paper, we introduce a novel video captioning framework, i.e., Two-View Transformer (TVT). TVT comprises of a backbone of Transformer network for sequential representation and two types of fusion blocks in decoder layers for combining different modalities effectively. Empirical study shows that our TVT model outperforms the state-of-the-art methods on the MSVD dataset and achieves a competitive performance on the MSR-VTT dataset under four common metrics.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Natural Language Processing

🧭 Keyword Pioneer — sequence representation

🐣 Hot Topic Early Bird — transformer model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ming Chen , Yingming Li , Zhongfei Zhang , Siyu Huang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Architectures > Transformers Computer Vision > Generation > Video Generation Natural Language Processing > Generation > Text Generation Deep Learning > Models > Transformers

Keywords

video captioning natural language generation multimodal learning sequence representation video understanding transformer network encoder-decoder architecture transformer model

Download PDF

Related papers

Unsupervised Heterogeneous Domain Adaptation with Sparse Feature Transformation 2018

Structured Gaussian Processes with Twin Multiple Kernel Learning 2018

Discriminative Feature Representation for Person Re-identification by Batch-contrastive Loss 2018

Adversarial TableQA: Attention Supervision for Question Answering on Tables 2018

Who Are Raising Their Hands? Hand-Raiser Seeking Based on Object Detection and Pose Estimation 2018