MERMAID: Multi-perspective Self-reflective Agents with Generative Augmentation for Emotion Recognition

Zhongyu Yang; Junhao Song; Siyang Song; Wei Pang; Yingfang Yuan

2025 EMNLP EMNLP 2025

MERMAID: Multi-perspective Self-reflective Agents with Generative Augmentation for Emotion Recognition

Abstract

AbstractMultimodal large language models (MLLMs) have demonstrated strong performance across diverse multimodal tasks, achieving promising outcomes. However, their application to emotion recognition in natural images remains underexplored. MLLMs struggle to handle ambiguous emotional expressions and implicit affective cues, whose capability is crucial for affective understanding but largely overlooked. To address these challenges, we propose MERMAID, a novel multi-agent framework that integrates a multi-perspective self-reflection module, an emotion-guided visual augmentation module, and a cross-modal verification module. These components enable agents to interact across modalities and reinforce subtle emotional semantics, thereby enhancing emotion recognition and supporting autonomous performance. Extensive experiments show that MERMAID outperforms existing methods, achieving absolute accuracy gains of 8.70%–27.90% across diverse benchmarks and exhibiting greater robustness in emotionally diverse scenarios.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Interdisciplinary

🧭 Keyword Pioneer — cross-modal verification

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zhongyu Yang , Junhao Song , Siyang Song , Wei Pang , Yingfang Yuan

Topics

Artificial Intelligence > Core AI > Agent Systems Artificial Intelligence > Core AI > Multi-Agent Systems Artificial Intelligence > Core AI > Multimodal Learning Interdisciplinary > Social > Affective Computing Deep Learning > Learning Types > Multi-Modal Learning

Keywords

emotion recognition generative augmentation multimodal large language model multi-agent system visual augmentation cross-modal verification

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025