Using the Bag-of-Audio-Word Feature Representation of ASR DNN Posteriors for Paralinguistic Classification

Gábor Gosztolya

2019 INTERSPEECH INTERSPEECH 2019

Using the Bag-of-Audio-Word Feature Representation of ASR DNN Posteriors for Paralinguistic Classification

Abstract

The Bag-of-Audio-Word (or BoAW) representation is an utterance-level feature representation approach that was successfully applied in the past in various computational paralinguistic tasks. Here, we extend the BoAW feature extraction process with the use of Deep Neural Networks: first we train a DNN acoustic model on an acoustic dataset consisting of 22 hours of speech for phoneme identification, then we evaluate this DNN on a standard paralinguistic dataset. To construct utterance-level features from the frame-level posterior vectors, we calculate their BoAW representation. We found that this approach can be utilized even on its own, although the results obtained lag behind those of the standard paralinguistic approach, and the optimal size of the extracted feature vectors tends to be large. Our approach, however, can be easily and efficiently combined with the standard paralinguistic one, resulting in the highest Unweighted Average Recall (UAR) score achieved so far for a general paralinguistic dataset.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Gábor Gosztolya

Topics

Machine Learning > Core Methods > Representation Learning Machine Learning > Application Areas > Domain Adaptation Deep Learning > Architectures > Neural Networks

Keywords

feature representation acoustic model deep neural network paralinguistic classification

Download PDF

Related papers

Using Real-Time Visual Biofeedback for Second Language Instruction 2019

VAE-Based Regularization for Deep Speaker Embedding 2019

End-to-End SpeakerBeam for Single Channel Target Speech Recognition 2019

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition 2019

Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile 2019