2019 INTERSPEECH INTERSPEECH 2019

Using the Bag-of-Audio-Word Feature Representation of ASR DNN Posteriors for Paralinguistic Classification

Abstract

The Bag-of-Audio-Word (or BoAW) representation is an utterance-level feature representation approach that was successfully applied in the past in various computational paralinguistic tasks. Here, we extend the BoAW feature extraction process with the use of Deep Neural Networks: first we train a DNN acoustic model on an acoustic dataset consisting of 22 hours of speech for phoneme identification, then we evaluate this DNN on a standard paralinguistic dataset. To construct utterance-level features from the frame-level posterior vectors, we calculate their BoAW representation. We found that this approach can be utilized even on its own, although the results obtained lag behind those of the standard paralinguistic approach, and the optimal size of the extracted feature vectors tends to be large. Our approach, however, can be easily and efficiently combined with the standard paralinguistic one, resulting in the highest Unweighted Average Recall (UAR) score achieved so far for a general paralinguistic dataset.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio