Learning Mutual Correlation in Multimodal Transformer for Speech Emotion Recognition

Yuhua Wang; Guang Shen; Yuezhu Xu; Jiahang Li; Zhengdao Zhao

2021 INTERSPEECH INTERSPEECH 2021

Learning Mutual Correlation in Multimodal Transformer for Speech Emotion Recognition

Abstract

Various studies have confirmed the necessity and benefits of leveraging multimodal features for SER, and the latest research results show that the temporal information captured by the transformer is very useful for improving multimodal speech emotion recognition. However, the dependency between different modalities and high-level temporal-feature learning using a deeper transformer is yet to be investigated. Thus, we propose a multimodal transformer with sharing weights for speech emotion recognition. The proposed network shares the weights across the modalities in each transformer layer to learn the correlation among multiple modalities. In addition, since the emotion contained in a speech generally include audio and text features, both of which have not only internal dependence but also mutual dependence, we design a deep multimodal attention mechanism to capture these two kinds of emotional dependence. We evaluated our model on the publicly available IEMOCAP dataset. The experimental results demonstrate that the proposed model yielded a promising result.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio

🧭 Keyword Pioneer — mutual correlation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yuhua Wang , Guang Shen , Yuezhu Xu , Jiahang Li , Zhengdao Zhao

Topics

Machine Learning > Core Methods > Representation Learning Deep Learning > Architectures > Transformers Speech & Audio > Analysis > Speech Analysis Deep Learning > Learning Types > Multi-Modal Learning

Keywords

attention mechanism multimodal learning multimodal transformer speech emotion recognition mutual correlation

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021