Emotion Identification from Raw Speech Signals Using DNNs

Mousmita Sarma; Pegah Ghahremani; Daniel Povey; Nagendra Kumar Goel; Kandarpa Kumar Sarma; Najim Dehak

2018 INTERSPEECH INTERSPEECH 2018

Emotion Identification from Raw Speech Signals Using DNNs

Abstract

We investigate a number of Deep Neural Network (DNN) architectures for emotion identification with the IEMOCAP database. First we compare different feature extraction front-ends: we compare high-dimensional MFCC input (equivalent to filterbanks), versus frequency-domain and time-domain approaches to learning filters as part of the network. We obtain the best results with the time-domain filter-learning approach. Next we investigated different ways to aggregate information over the duration of an utterance. We tried approaches with a single label per utterance with time aggregation inside the network; and approaches where the label is repeated for each frame. Having a separate label per frame seemed to work best and the best architecture that we tried interleaves TDNN-LSTM with time-restricted self-attention, achieving a weighted accuracy of 70.6%, versus 61.8% for the best previously published system which used 257-dimensional Fourier log-energies as input.

🐣 Hot Topic Early Bird — temporal modeling

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Mousmita Sarma , Pegah Ghahremani , Daniel Povey , Nagendra Kumar Goel , Kandarpa Kumar Sarma , Najim Dehak

Topics

Machine Learning > Core Methods > Classification

Keywords

temporal modeling feature extraction speaker verification speech emotion recognition

Download PDF

Related papers

HoloCompanion: An MR Friend for EveryOne 2018

Estimation of the Vocal Tract Length of Vowel Sounds Based on the Frequency of the Significant Spectral Valley 2018

Deep Learning Techniques for Koala Activity Detection 2018

An Exploration of Local Speaking Rate Variations in Mandarin Read Speech 2018

Acoustic Analysis of Whispery Voice Disguise in Mandarin Chinese 2018