Generating Natural Video Descriptions via Multimodal Processing

Qin Jin; Junwei Liang; Xiaozhu Lin

2016 INTERSPEECH INTERSPEECH 2016

Generating Natural Video Descriptions via Multimodal Processing

Abstract

Generating natural language descriptions of visual content is an intriguing task which has wide applications such as assisting blind people. The recent advances in image captioning stimulate further study of this task in more depth including generating natural descriptions for videos. Most works of video description generation focus on visual information in the video. However, audio provides rich information for describing video contents as well. In this paper, we propose to generate video descriptions in natural sentences via multimodal processing, which refers to using both audio and visual cues via unified deep neural networks with both convolutional and recurrent structure. Experimental results on the Microsoft Research Video Description (MSVD) corpus prove that fusing audio information greatly improves the video description performance. We also investigate the impact of image amount vs caption amount on the image caption performance and see the trend that when limited amount of training is available, number of various captions is more important than number of various images. This will guide us to investigate in the future how to improve the video description system via increasing amount of training data.

🚀 Conference Pioneer — INTERSPEECH 2016

🌱 Topic Pioneer — Transformers

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

📈 Trend Setter — Transformers

🧭 Keyword Pioneer — natural language generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

🐣 Hot Topic Early Bird — natural language generation