2022 INTERSPEECH INTERSPEECH 2022

Exploiting Fine-tuning of Self-supervised Learning Models for Improving Bi-modal Sentiment Analysis and Emotion Recognition

Abstract

Speech-based multimodal affective computing has recently attracted significant research attention. Previous experimental results have shown that the audio-only approach exhibits inferior performance than the text-only approach in sentiment analysis and emotion recognition tasks. In this paper, we propose a new strategy to improve the performance of uni-modal and bi-modal affective computing systems via fine-tuning of two pre-trained self-supervised learning models (Text-RoBERTa and Speech-RoBERTa). We fine-tune the models on sentiment analysis and emotion recognition tasks using a shallow architecture, and apply crossmodal attention fusion to the models for further learning and final prediction or classification. We evaluate our proposed method on the CMU-MOSI, CMU-MOSEI and IEMOCAP datasets. The experimental results demonstrate that our approach exhibits superior performance for all benchmarks compared to existing state-of-the-art results, establishing the effectiveness of the proposed method.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio