An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos

Sicheng Zhao; Yunsheng Ma; Yang Gu; Jufeng Yang; Tengfei Xing; Pengfei Xu; Runbo Hu; Hua Chai; Kurt Keutzer

2020 AAAI AAAI 2020

An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos

Abstract

Abstract Emotion recognition in user-generated videos plays an important role in human-centered computing. Existing methods mainly employ traditional two-stage shallow pipeline, i.e. extracting visual and/or audio features and training classifiers. In this paper, we propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs). Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN. Further, we design a special classification loss, i.e. polarity-consistent cross-entropy loss, based on the polarity-emotion hierarchy constraint to guide the attention generation. Extensive experiments conducted on the challenging VideoEmotion-8 and Ekman-6 datasets demonstrate that the proposed VAANet outperforms the state-of-the-art approaches for video emotion recognition. Our source code is released at: https://github.com/maysonma/VAANet.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Interdisciplinary

🧭 Keyword Pioneer — visual-audio processing

🐣 Hot Topic Early Bird — cross-entropy loss

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sicheng Zhao , Yunsheng Ma , Yang Gu , Jufeng Yang , Tengfei Xing , Pengfei Xu , Runbo Hu , Hua Chai , Kurt Keutzer

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Architectures > Neural Networks Computer Vision > Processing > Video Understanding Interdisciplinary > Social > Affective Computing Computer Vision > Core AI > Multimodal Learning Deep Learning > Learning Types > Multimodal Learning Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

attention mechanism multimodal learning emotion recognition video understanding convolutional neural network cross-entropy loss visual-audio processing video emotion recognition visual-audio attention video emotion

Download PDF

Related papers

Enhancing Pointer Network for Sentence Ordering with Pairwise Ordering Predictions 2020

CopyMTL: Copy Mechanism for Joint Extraction of Entities and Relations with Multi-Task Learning 2020

Neural Simile Recognition with Cyclic Multitask Learning and Local Attention 2020

Being Optimistic to Be Conservative: Quickly Learning a CVaR Policy 2020

Multi-Point Semantic Representation for Intent Classification 2020