2017 INTERSPEECH INTERSPEECH 2017

NTCD-TIMIT: A New Database and Baseline for Noise-Robust Audio-Visual Speech Recognition

Abstract

Although audio-visual speech is well known to improve the robustness properties of automatic speech recognition (ASR) systems against noise, the realm of audio-visual ASR (AV-ASR) has not gathered the research momentum it deserves. This is mainly due to the lack of audio-visual corpora and the need to combine two fields of knowledge: ASR and computer vision. This paper describes the NTCD-TIMIT database and baseline that can overcome these two barriers and attract more research interest to AV-ASR. The NTCD-TIMIT corpus has been created by adding six noise types at a range of signal-to-noise ratios to the speech material of the recently published TCD-TIMIT corpus. NTCD-TIMIT comprises visual features that have been extracted from the TCD-TIMIT video recordings using the visual front-end presented in this paper. The database contains also Kaldi scripts for training and decoding audio-only, video-only, and audio-visual ASR models. The baseline experiments and results obtained using these scripts are detailed in this paper.

🧭 Keyword Pioneer — noise-robust speech recognition
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio