End-to-End Speech Recognition with Auditory Attention for Multi-Microphone Distance Speech Recognition

Suyoun Kim; Ian Lane

2017 INTERSPEECH INTERSPEECH 2017

End-to-End Speech Recognition with Auditory Attention for Multi-Microphone Distance Speech Recognition

Abstract

End-to-End speech recognition is a recently proposed approach that directly transcribes input speech to text using a single model. End-to-End speech recognition methods including Connectionist Temporal Classification and Attention-based Encoder Decoder Networks have been shown to obtain state-of-the-art performance on a number of tasks and significantly simplify the modeling, training and decoding procedures for speech recognition. In this paper, we extend our prior work on End-to-End speech recognition focusing on the effectiveness of these models in far-field environments. Specifically, we propose introducing Auditory Attention to integrate input from multiple microphones directly within an End-to-End speech recognition model, leveraging the attention mechanism to dynamically tune the model’s attention to the most reliable input sources. We evaluate our proposed model on the CHiME-4 task, and show substantial improvement compared to a model optimized for a single microphone input.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio

🐣 Hot Topic Early Bird — connectionist temporal classification

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Robotics, Security & Privacy, Speech & Audio

Authors

Suyoun Kim , Ian Lane

Topics

Machine Learning > Application Areas > Domain Adaptation Deep Learning > Architectures > Transformers Speech & Audio > Recognition > Speech Recognition

Keywords

connectionist temporal classification end-to-end speech recognition far-field speech recognition auditory attention

Download PDF

Related papers

Description of the Munich-Passau Snore Sound Corpus (MPSSC) 2017

A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification 2017

Binaural Reverberant Speech Separation Based on Deep Neural Networks 2017

Building Audio-Visual Phonetically Annotated Arabic Corpus for Expressive Text to Speech 2017

A Comparison of Danish Listeners’ Processing Cost in Judging the Truth Value of Norwegian, Swedish, and English Sentences 2017