Speech Emotion Recognition via Multi-Level Cross-Modal Distillation

Ruichen Li; Jinming Zhao; Qin Jin

2021 INTERSPEECH INTERSPEECH 2021

Speech Emotion Recognition via Multi-Level Cross-Modal Distillation

Abstract

Speech emotion recognition faces the problem that most of the existing speech corpora are limited in scale and diversity due to the high annotation cost and label ambiguity. In this work, we explore the task of learning robust speech emotion representations based on large unlabeled speech data. Under a simple assumption that the internal emotional states across different modalities are similar, we propose a method called Multi-level Cross-modal Emotion Distillation (MCED), which trains the speech emotion model without any labeled speech emotion data by transferring emotion knowledge from a pretrained text emotion model. Extensive experiments on two benchmark datasets, IEMOCAP and MELD, show that our proposed MCED can help learn effective speech emotion representations which generalize well on downstream speech emotion recognition tasks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — pretrained text model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Deep Learning, Knowledge & Reasoning, Machine Learning, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Ruichen Li , Jinming Zhao , Qin Jin

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Application Areas > Knowledge Distillation Speech & Audio > Analysis > Speech Analysis Deep Learning > Learning Types > Knowledge Distillation

Keywords

representation learning knowledge transfer unlabeled datum cross-modal distillation speech representation speech emotion recognition pretrained text model

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021