Probing Speech Quality Information in ASR Systems

Bao Thang Ta; Minh Tu Le; Nhat Minh Le; Van Hai Do

2023 INTERSPEECH INTERSPEECH 2023

Probing Speech Quality Information in ASR Systems

Abstract

This paper investigates how intermediate speech representations in a state-of-the-art automatic speech recognition (ASR) system encode multi-dimensional speech quality, including MOS, Noisiness, Coloration, Discontinuity, and Loudness. We found that speech quality information is encoded in the ASR encoder layers at various levels but is still much richer than the Mel-spectrogram, an input widely used in previous works. This discovery inspires us to develop the Attentive Conformer with ASR pretraining, a novel deep learning model that enables the utilization of rich information from pretrained ASR models and the ability to focus on specific layers. Experiments on the NISQA speech quality assessment dataset demonstrate that the proposed model achieves state-of-the-art performance with significantly less training data.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio