2024 INTERSPEECH INTERSPEECH 2024

Speech Emotion Recognition with Multi-level Acoustic and Semantic Information Extraction and Interaction

Abstract

Speech emotion recognition (SER) systems can learn linguistic information by integrating automatic speech recognition (ASR). However, existing SER systems fall short in explicitly learning semantic emotional information from ASR predictions. Our proposed system addresses this problem by incorporating a semantic feature extractor for explicit emotional information extraction. Furthermore, a cross attention-based information interaction module is proposed to learn the complementary emotional information in the embeddings from both feature extractors. Within the interaction module, a temporal-aware gate fusion network is incorporated to dynamically integrate the embeddings from acoustic and semantic feature extractors and mitigate the impact of ASR errors in SER. Experimental results on IEMOCAP show that our system outperforms the existing SER systems by improving the unweighted accuracy by 3.32%.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio
🧭 Keyword Pioneer — temporal-aware gate fusion
🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio