Extending an Acoustic Data-Driven Phone Set for Spontaneous Speech Recognition

Jeong-Uk Bang; Mu-Yeol Choi; Sang-Hun Kim; Oh-Wook Kwon

2019 INTERSPEECH INTERSPEECH 2019

Extending an Acoustic Data-Driven Phone Set for Spontaneous Speech Recognition

Abstract

In this paper, we propose a method to extend a phone set by using a large amount of Korean broadcast data to improve the performance of spontaneous speech recognition. The proposed method first extracts variable-length phoneme-level segments from broadcast data, and then converts them into fixed-length latent vectors based on an LSTM architecture. Then, we used the k-means algorithm to cluster acoustically similar latent vectors and then build a new phone set by gathering the clustered vectors. To update the lexicon of a speech recognizer, we choose the pronunciation sequence of each word with the highest conditional probability. To verify the performance of the proposed unit, we visualize the spectral patterns and segment duration for the new phone set. In both spontaneous and read speech recognition tasks, the proposed unit is shown to produce better performance than the phoneme-based and grapheme-based units.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🧭 Keyword Pioneer — spontaneous speech recognition

🐝 Cross-Pollinator — Artificial Intelligence, Deep Learning, Machine Learning, Natural Language Processing, Speech & Audio

Authors

Jeong-Uk Bang , Mu-Yeol Choi , Sang-Hun Kim , Oh-Wook Kwon

Topics

Machine Learning > Core Methods > Clustering Speech & Audio > Recognition > Speech Recognition

Keywords

k-means clustering speech recognition phone recognition lstm architecture spontaneous speech phoneme segmentation spontaneous speech recognition phone set latent vector clustering

Download PDF

Related papers

Using Real-Time Visual Biofeedback for Second Language Instruction 2019

VAE-Based Regularization for Deep Speaker Embedding 2019

End-to-End SpeakerBeam for Single Channel Target Speech Recognition 2019

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition 2019

Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile 2019