2021 INTERSPEECH INTERSPEECH 2021

Investigating the Impact of Spectral and Temporal Degradation on End-to-End Automatic Speech Recognition Performance

Abstract

Humans have a sophisticated capability to robustly handle incomplete sensory input, as often happens in real environments. In earlier studies, the robustness of human speech perception was observed qualitatively by spectrally and temporally degraded stimuli. The current study investigates how machine speech recognition, especially end-to-end automatic speech recognition (E2E-ASR), can yield similar robustness against distorted acoustic cues. To evaluate the performance of E2E-ASR, we employ four types of distorted speech based on previous studies: locally time-reversed speech, noise-vocoded speech, phonemic restoration, and modulation-filtered speech. Those stimuli are synthesized by spectral and/or temporal manipulation from original speech samples whose human speech intelligibility scores have been well-reported. An experiment was conducted on the TED-LIUM2 for English and the Corpus of Spontaneous Japanese (CSJ) for Japanese. We found that while there is a tendency to exhibit similar robustness in some experiments, full recovery from the harmful effect of the severe spectral degradation is not achieved.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Speech & Audio
🧭 Keyword Pioneer — spectral degradation
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio