Emotion Recognition in Multi-Speaker Conversations through Speaker Identification, Knowledge Distillation, and Hierarchical Fusion

Li Xiao; Kotaro Funakoshi; Manabu Okumura

2026 EACL EACL 2026

Emotion Recognition in Multi-Speaker Conversations through Speaker Identification, Knowledge Distillation, and Hierarchical Fusion

Abstract

AbstractEmotion recognition in multi-speaker conversations faces significant challenges due to speaker ambiguity and severe class imbalance. We propose a novel framework that addresses these issues through three key innovations: (1) a speaker identification module that leverages audio-visual synchronization to accurately identify the active speaker, (2) a knowledge distillation strategy that transfers superior textual emotion understanding to audio and visual modalities, and (3) hierarchical attention fusion with composite loss functions to handle class imbalance. Comprehensive evaluations on MELD and IEMOCAP datasets demonstrate superior performance, achieving 67.75% and 72.44% weighted F1 scores respectively, with particularly notable improvements on minority emotion classes.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Li Xiao , Kotaro Funakoshi , Manabu Okumura

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Knowledge Distillation Deep Learning > Techniques > Model Architecture

Keywords

knowledge distillation emotion recognition multi-speaker conversation hierarchical attention speaker identification

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026