Multi-Modal Learning for Speech Emotion Recognition: An Analysis and Comparison of ASR Outputs with Ground Truth Transcription

Saurabh Sahu; Vikramjit Mitra; Nadee Seneviratne; Carol Espy-Wilson

2019 INTERSPEECH INTERSPEECH 2019

Multi-Modal Learning for Speech Emotion Recognition: An Analysis and Comparison of ASR Outputs with Ground Truth Transcription

Abstract

In this paper we plan to leverage multi-modal learning and automated speech recognition (ASR) systems toward building a speech-only emotion recognition model. Previous studies have shown that emotion recognition models using only acoustic features do not perform satisfactorily in detecting valence level. Text analysis has been shown to be helpful for sentiment classification. We compared classification accuracies obtained from an audio-only model, a text-only model and a multi-modal system leveraging both by performing a cross-validation analysis on IEMOCAP dataset. Confusion matrices show it’s the valence level detection that is being improved by incorporating textual information. In the second stage of experiments, we used two ASR application programming interfaces (APIs) to get the transcriptions. We compare the performances of multi-modal systems using the ASR transcriptions with each other and with that of one using ground truth transcription. We analyze the confusion matrices to determine the effect of using ASR transcriptions instead of ground truth ones on class-wise accuracies. We investigate the generalisability of such a model by performing a cross-corpus study.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Speech & Audio

🧭 Keyword Pioneer — ground truth transcription

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Saurabh Sahu , Vikramjit Mitra , Nadee Seneviratne , Carol Espy-Wilson

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Classification Speech & Audio > Analysis > Clinical Speech Analysis

Keywords

multi-modal learning automatic speech recognition speech emotion recognition ground truth transcription valence detection cross-validation analysis

Download PDF

Related papers

Using Real-Time Visual Biofeedback for Second Language Instruction 2019

VAE-Based Regularization for Deep Speaker Embedding 2019

End-to-End SpeakerBeam for Single Channel Target Speech Recognition 2019

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition 2019

Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile 2019