Audio-Visual Speech Emotion Recognition by Disentangling Emotion and Identity Attributes

Koichiro Ito; Takuya Fujioka; Qinghua Sun; Kenji Nagamatsu

2021 INTERSPEECH INTERSPEECH 2021

Audio-Visual Speech Emotion Recognition by Disentangling Emotion and Identity Attributes

Abstract

In this paper, we propose an audio-visual speech emotion recognition (AV-SER) that can suppress the disturbance from an identity attribute by disentangling an emotion attribute and an identity one. We developed a model that first disentangles both attributes for each modality. In order to achieve the disentanglement, we introduce a co-attention module to our model. Our model disentangles the emotion attribute by giving the identity attribute as conditional features to the module. Conversely, the identity attribute is also obtained with the emotion attribute as a condition. Our model then makes a prediction for each attribute from these disentangled features by considering both modalities. In addition, to ensure the disentanglement capacity of our model, we train the model with an identification task as the auxiliary task and an SER task as the primary task alternately, and we update only the part of parameters responsible for each task. The experimental result shows the effectiveness of our method with the wild CMU-MOSEI dataset.

🧭 Keyword Pioneer — identity attribute

🐣 Hot Topic Early Bird — cross-modal attention

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Koichiro Ito , Takuya Fujioka , Qinghua Sun , Kenji Nagamatsu

Topics

Artificial Intelligence > Core AI > Multimodal Learning

Keywords

feature disentanglement emotion classification cross-modal attention audio-visual emotion recognition identity attribute

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021