Speaker-aware Cross-modal Fusion Architecture for Conversational Emotion Recognition

Huan Zhao; Bo Li; Zixing Zhang

2023 INTERSPEECH INTERSPEECH 2023

Speaker-aware Cross-modal Fusion Architecture for Conversational Emotion Recognition

Abstract

Conversational Emotion Recognition (CER) is an important topic in the construction of intelligent human-machine interaction systems. The emotion is mainly influenced by the conversational context and the speakers. In addition, sufficient utilization of the relevant features of both speech and text modes is also crucial to the performance of CER. Based on the above considerations, we propose a novel Speaker-aware Cross-modal Fusion Architecture (SCFA). Within a single modality, we design a conversation encoder, including a context encoder and a speaker-aware encoder, to model the conversational content and the intra- and inter-speaker influence, respectively. On this basis, cross-modal fusion attention is introduced to extract the cross-modal characteristics of the conversation, so as to better detect the emotions in conversation. We conduct experiments on the IEMOCAP and MELD datasets. Compared with state-of-the-art baselines, SCFA achieves better performance on average.

🧭 Keyword Pioneer — cross-modal fusion

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio