Augmented Partial Mutual Learning with Frame Masking for Video Captioning

Ke Lin; Zhuoxin Gan; Liwei Wang

2021 AAAI AAAI 2021

Augmented Partial Mutual Learning with Frame Masking for Video Captioning

Abstract

Abstract Recent video captioning work improves greatly due to the invention of various elaborate model architectures. If multiple captioning models are combined into a unified framework not only by simple more ensemble, and each model can benefit from each other, the final captioning might be boosted further. Jointly training of multiple model have not been explored in previous works. In this paper, we propose a novel Augmented Partial Mutual Learning (APML) training method where multiple decoders are trained jointly with mimicry losses between different decoders and different input variations. Another problem of training captioning model is the "one-to-many" mapping problem which means that one identical video input is mapped to multiple caption annotations. To address this problem, we propose an annotation-wise frame masking approach to convert the "one-to-many" mapping to "one-to-one" mapping. The experiments performed on MSR-VTT and MSVD datasets demonstrate our proposed algorithm achieves the state-of-the-art performance.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — frame masking

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ke Lin , Zhuoxin Gan , Liwei Wang

Topics

Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Learning Types > Semi-Supervised Learning Computer Vision > Generation > Image Captioning Computer Vision > Generation > Video Generation Natural Language Processing > Generation > Text Generation Deep Learning > Learning Types > Multi-Task Learning

Keywords

ensemble learning video captioning self-supervised learning mutual learning neural network frame masking

Download PDF

Related papers

Contextual Conditional Reasoning 2021

Attention Beam: An Image Captioning Approach (Student Abstract) 2021

Movie Summarization via Sparse Graph Construction 2021

Text Analysis for Understanding Symptoms of Social Anxiety in Student Veterans 2021

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs 2021