Speech Emotion Recognition with Multi-level Acoustic and Semantic Information Extraction and Interaction

Yuan Gao; Hao Shi; Chenhui Chu; Tatsuya Kawahara

2024 INTERSPEECH INTERSPEECH 2024

Speech Emotion Recognition with Multi-level Acoustic and Semantic Information Extraction and Interaction

Abstract

Speech emotion recognition (SER) systems can learn linguistic information by integrating automatic speech recognition (ASR). However, existing SER systems fall short in explicitly learning semantic emotional information from ASR predictions. Our proposed system addresses this problem by incorporating a semantic feature extractor for explicit emotional information extraction. Furthermore, a cross attention-based information interaction module is proposed to learn the complementary emotional information in the embeddings from both feature extractors. Within the interaction module, a temporal-aware gate fusion network is incorporated to dynamically integrate the embeddings from acoustic and semantic feature extractors and mitigate the impact of ASR errors in SER. Experimental results on IEMOCAP show that our system outperforms the existing SER systems by improving the unweighted accuracy by 3.32%.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🧭 Keyword Pioneer — temporal-aware gate fusion

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yuan Gao , Hao Shi , Chenhui Chu , Tatsuya Kawahara

Topics

Machine Learning > Learning Types > Contrastive Learning Speech & Audio > Analysis > Clinical Speech Analysis

Keywords

acoustic feature extraction cross attention semantic feature extraction speech emotion recognition temporal-aware gate fusion

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024