2017 INTERSPEECH INTERSPEECH 2017

End-to-End Speech Recognition with Auditory Attention for Multi-Microphone Distance Speech Recognition

Abstract

End-to-End speech recognition is a recently proposed approach that directly transcribes input speech to text using a single model. End-to-End speech recognition methods including Connectionist Temporal Classification and Attention-based Encoder Decoder Networks have been shown to obtain state-of-the-art performance on a number of tasks and significantly simplify the modeling, training and decoding procedures for speech recognition. In this paper, we extend our prior work on End-to-End speech recognition focusing on the effectiveness of these models in far-field environments. Specifically, we propose introducing Auditory Attention to integrate input from multiple microphones directly within an End-to-End speech recognition model, leveraging the attention mechanism to dynamically tune the model’s attention to the most reliable input sources. We evaluate our proposed model on the CHiME-4 task, and show substantial improvement compared to a model optimized for a single microphone input.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio
🐣 Hot Topic Early Bird — connectionist temporal classification
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Robotics, Security & Privacy, Speech & Audio

Authors