Investigating the Impact of Spectral and Temporal Degradation on End-to-End Automatic Speech Recognition Performance

Takanori Ashihara; Takafumi Moriya; Makio Kashino

2021 INTERSPEECH INTERSPEECH 2021

Investigating the Impact of Spectral and Temporal Degradation on End-to-End Automatic Speech Recognition Performance

Abstract

Humans have a sophisticated capability to robustly handle incomplete sensory input, as often happens in real environments. In earlier studies, the robustness of human speech perception was observed qualitatively by spectrally and temporally degraded stimuli. The current study investigates how machine speech recognition, especially end-to-end automatic speech recognition (E2E-ASR), can yield similar robustness against distorted acoustic cues. To evaluate the performance of E2E-ASR, we employ four types of distorted speech based on previous studies: locally time-reversed speech, noise-vocoded speech, phonemic restoration, and modulation-filtered speech. Those stimuli are synthesized by spectral and/or temporal manipulation from original speech samples whose human speech intelligibility scores have been well-reported. An experiment was conducted on the TED-LIUM2 for English and the Corpus of Spontaneous Japanese (CSJ) for Japanese. We found that while there is a tendency to exhibit similar robustness in some experiments, full recovery from the harmful effect of the severe spectral degradation is not achieved.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Speech & Audio

🧭 Keyword Pioneer — spectral degradation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Takanori Ashihara , Takafumi Moriya , Makio Kashino

Topics

Speech & Audio > Recognition > Automatic Speech Recognition Artificial Intelligence > Core AI > Efficient Computing

Keywords

automatic speech recognition speech intelligibility end-to-end model end-to-end speech recognition speech robustness spectral degradation temporal degradation

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021