2022 INTERSPEECH INTERSPEECH 2022

Improving Phonetic Transcriptions of Children’s Speech by Pronunciation Modelling with Constrained CTC-Decoding

Abstract

Language sample analysis (LSA) is a powerful tool for both therapeutic applications and research of child speech and language development. Nevertheless, it is not routinely used, due to the high cost of manual transcription and analysis. Assistance by automatic speech recognition for children has the potential to enable a wide-spread use of LSA. However, the development of modern speech recognition systems heavily relies on large scale datasets. Therefore, it faces the same obstacle of high cost for transcription as LSA itself. In this paper, we study how cheaply transcribed child speech, i. e., limited to an orthographic transcription, can be improved on a phonetic level by leveraging a CTC based automatic speech recognition model, trained on a small phonetically transcribed dataset. We constrain the CTC decoding by modeling variation of the pronunciation given the orthographic transcription using weighted finite state automata. Our experiments show that the transcription is improved in terms of phone error rate by relative 14% when applying our method. Additionally, we show how the improved transcript can in turn be leveraged to improve the training of a new model.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio
🐣 Hot Topic Early Bird — constrained decoding
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio