Leveraging CLIP Encoder for Multimodal Emotion Recognition

Yehun Song; Sunyoung Cho

2025 WACV WACV 2025

Leveraging CLIP Encoder for Multimodal Emotion Recognition

Abstract

Multimodal emotion recognition (MER) aims to identify human emotions by combining data from various modalities such as language audio and vision. Despite the recent advances of MER approaches the limitations in obtaining extensive datasets impede the improvement of performance. To mitigate this issue we leverage a Contrastive Language-Image Pre-training (CLIP)-based architecture and its semantic knowledge from massive datasets that aims to enhance the discriminative multimodal representation. We propose a label encoder-guided MER framework based on CLIP (MER-CLIP) to learn emotion-related representations across modalities. Our approach introduces a label encoder that treats labels as text embeddings to incorporate their semantic information leading to the learning of more representative emotional features. To further exploit label semantics we devise a cross-modal decoder that aligns each modality to a shared embedding space by sequentially fusing modality features based on emotion-related input from the label encoder. Finally the label encoder-guided prediction enables generalization across diverse labels by embedding their semantic information as well as word labels. Experimental results show that our method outperforms the state-of-the-art MER methods on the benchmark datasets CMU-MOSI and CMU-MOSEI.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — cross-modal decoder

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio

Authors

Yehun Song , Sunyoung Cho

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Classification

Keywords

semantic information multimodal emotion recognition contrastive language-image pre-training label encoder cross-modal decoder

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025