2020 INTERSPEECH INTERSPEECH 2020

Exploiting Cross-Domain Visual Feature Generation for Disordered Speech Recognition

Abstract

Audio-visual speech recognition (AVSR) technologies have been successfully applied to a wide range of tasks. When developing AVSR systems for disordered speech characterized by severe degradation of voice quality and large mismatch against normal, it is difficult to record large amounts of high quality audio-visual data. In order to address this issue, a cross-domain visual feature generation approach is proposed in this paper. Audio-visual inversion DNN system constructed using widely available out-of-domain audio-visual data was used to generate visual features for disordered speakers for whom video data is either very limited or unavailable. Experiments conducted on the UASpeech corpus suggest that the proposed cross-domain visual feature generation based AVSR system consistently outperformed the baseline ASR system and AVSR system using original visual features. An overall word error rate reduction of 3.6% absolute (14% relative) was obtained over the previously published best system on the 8 UASpeech dysarthric speakers with audio-visual data of the same task.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio
🧭 Keyword Pioneer — visual feature generation
🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Deep Learning, Machine Learning, Natural Language Processing, Speech & Audio
🐣 Hot Topic Early Bird — cross-domain learning