2023 INTERSPEECH INTERSPEECH 2023

Two-stage Finetuning of Wav2vec 2.0 for Speech Emotion Recognition with ASR and Gender Pretraining

Abstract

This paper addresses effective pretraining of automatic speech recognition (ASR) and gender recognition to improve wav2vec 2.0 embedding for speech emotion recognition (SER). Specifically, we propose a two-stage finetuning method, which first pretrains the self-supervised learning (SSL) model with ASR to learn the linguistic information and address the gradient conflict problem of conventional multi-task learning. Experimental results on the IEMOCAP dataset show that ASR pretraining can significantly outperform the simple MTL with ASR, and thus demonstrate the effectiveness of the two-stage finetuning method. We also investigate how to combine gender recognition with ASR pretraining to derive more effective embedding for SER. As the upper layers of the SSL model are focused on ASR, incorporating skip-connection can effectively embed the gender information. Compared with the single-task learning baseline, our method achieves a UA of 76.10% with an absolute improvement of 3.97%.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Speech & Audio
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio