2024 INTERSPEECH INTERSPEECH 2024

An Effective Local Prototypical Mapping Network for Speech Emotion Recognition

Abstract

Speech emotion recognition (SER) systems are generally optimized through utterance-level supervision, but emotion is complex and often varies within an utterance. This paper propose a local prototypical mapping network (LPMN) to model frame-level emotional variance and better exploit within-frame dynamics to improve performance. Specifically, a codebook of prototypes is first constructed to characterize complex frame-level features output from a pre-trained backbone network. An utterance-level embedding is obtained by selecting the most emotion-related mappings via a similarity measure between features and prototypes, motivated by multiple instance learning algorithms. Prototypes can be jointly optimized with quantization loss and CE loss. A prototype selection scheme is further proposed to select emotion-aware prototypes to reduce bias caused by irrelevant factors. Evaluations on IEMOCAP and MER2023 benchmarks demonstrate the effectiveness of LPMN.

🌉 Interdisciplinary Bridge — Computer Vision and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio