NTCD-TIMIT: A New Database and Baseline for Noise-Robust Audio-Visual Speech Recognition

Ahmed Hussen Abdelaziz

2017 INTERSPEECH INTERSPEECH 2017

NTCD-TIMIT: A New Database and Baseline for Noise-Robust Audio-Visual Speech Recognition

Abstract

Although audio-visual speech is well known to improve the robustness properties of automatic speech recognition (ASR) systems against noise, the realm of audio-visual ASR (AV-ASR) has not gathered the research momentum it deserves. This is mainly due to the lack of audio-visual corpora and the need to combine two fields of knowledge: ASR and computer vision. This paper describes the NTCD-TIMIT database and baseline that can overcome these two barriers and attract more research interest to AV-ASR. The NTCD-TIMIT corpus has been created by adding six noise types at a range of signal-to-noise ratios to the speech material of the recently published TCD-TIMIT corpus. NTCD-TIMIT comprises visual features that have been extracted from the TCD-TIMIT video recordings using the visual front-end presented in this paper. The database contains also Kaldi scripts for training and decoding audio-only, video-only, and audio-visual ASR models. The baseline experiments and results obtained using these scripts are detailed in this paper.

🧭 Keyword Pioneer — noise-robust speech recognition

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ahmed Hussen Abdelaziz

Topics

Speech & Audio > Recognition > Automatic Speech Recognition Speech & Audio > Recognition > Speech Recognition

Keywords

multimodal learning automatic speech recognition audio-visual speech recognition noise-robust speech recognition visual front-end

Download PDF

Related papers

Description of the Munich-Passau Snore Sound Corpus (MPSSC) 2017

A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification 2017

Binaural Reverberant Speech Separation Based on Deep Neural Networks 2017

Building Audio-Visual Phonetically Annotated Arabic Corpus for Expressive Text to Speech 2017

A Comparison of Danish Listeners’ Processing Cost in Judging the Truth Value of Norwegian, Swedish, and English Sentences 2017