End-to-End Speech Emotion Recognition Combined with Acoustic-to-Word ASR Model

Han Feng; Sei Ueno; Tatsuya Kawahara

2020 INTERSPEECH INTERSPEECH 2020

End-to-End Speech Emotion Recognition Combined with Acoustic-to-Word ASR Model

Abstract

In this paper, we propose speech emotion recognition (SER) combined with an acoustic-to-word automatic speech recognition (ASR) model. While acoustic prosodic features are primarily used for SER, textual features are also useful but are error-prone, especially in emotional speech. To solve this problem, we integrate ASR model and SER model in an end-to-end manner. This is done by using an acoustic-to-word model. Specifically, we utilize the states of the decoder in the ASR model with the acoustic features and input them into the SER model. On top of a recurrent network to learn features from this input, we adopt a self-attention mechanism to focus on important feature frames. Finally, we finetune the ASR model on the new dataset using a multi-task learning method to jointly optimize ASR with the SER task. Our model has achieved a 68.63% weighted accuracy (WA) and 69.67% unweighted accuracy (UA) on the IEMOCAP database, which is state-of-the-art performance.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Han Feng , Sei Ueno , Tatsuya Kawahara

Topics

Artificial Intelligence > Core AI > Multimodal Learning Speech & Audio > Recognition > Speech Recognition Machine Learning > Learning Types > Multi-Task Learning Speech & Audio > Analysis > Speech Analysis

Keywords

multi-task learning self-attention mechanism automatic speech recognition end-to-end learning end-to-end model speech emotion recognition acoustic-to-word model

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020