2024 INTERSPEECH INTERSPEECH 2024

Cross-modal Features Interaction-and-Aggregation Network with Self-consistency Training for Speech Emotion Recognition

Abstract

In recent years, much research has been into speech emotion recognition (SER) using multimodal data. Selective fusion of the features from different modalities is critical for multimodal SER. In this paper, we propose a cross-modal features interaction-and-aggregation network (CFIA-Net) with self-consistency training for SER. Specifically, we design a cross-modal features interaction-and-aggregation (CFIA) module to adaptively interact and integrate the features of audio and text modalities. Moreover, we introduce a self-consistency training strategy, which exploits the features from deeper layers to supervise those from shallower ones to obtain the SER task-related information. The experimental results show that compared with other bimodal SER methods, the CFIA-Net achieves the state-of-the-art performance on the weighted accuracy (WA) of 83.37% and unweighted accuracy (UA) of 83.67% on the IEMOCAP dataset.

🧭 Keyword Pioneer — audio-text modality
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Robotics, Speech & Audio