2023 INTERSPEECH INTERSPEECH 2023

Towards Paralinguistic-Only Speech Representations for End-to-End Speech Emotion Recognition

Abstract

We propose a methodology for information aggregation from the various transformer layer outputs of a generic speech Encoder (e.g. WavLM, HuBERT) for the downstream task of Speech Emotion Recognition (SER). The proposed methodology significantly reduces the dependency of model predictions on linguistic content, while leading to competitive performance without requiring costly Encoder re-training. The proposed paradigm is evaluated via Accuracy, Positive Pointwise Mutual Information, and visualization of the learned attention weights. This methodology generalizes well to a multi-language SER setting in addition to single-language SER, suggesting existing cultural commonalities in the paralinguistic domain between different languages. Experimental results demonstrate this ability by testing our model on unseen languages in a zero-shot fashion, suggesting our proposed method is inclusive in the context of speech and language, thus, making it applicable to a wide audience of speakers.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio