End-to-End Speech Recognition from the Raw Waveform

Neil Zeghidour; Nicolas Usunier; Gabriel Synnaeve; Ronan Collobert; Emmanuel Dupoux

2018 INTERSPEECH INTERSPEECH 2018

End-to-End Speech Recognition from the Raw Waveform

Abstract

State-of-the-art speech recognition systems rely on fixed, hand-crafted features such as mel-filterbanks to preprocess the waveform before the training pipeline. In this paper, we study end-to-end systems trained directly from the raw waveform, building on two alternatives for trainable replacements of mel-filterbanks that use a convolutional architecture. The first one is inspired by gammatone filterbanks (Hoshen et al., 2015; Sainath et al, 2015) and the second one by the scattering transform (Zeghidour et al., 2017). We propose two modifications to these architectures and systematically compare them to mel-filterbanks, on the Wall Street Journal dataset. The first modification is the addition of an instance normalization layer, which greatly improves on the gammatone-based trainable filterbanks and speeds up the training of the scattering-based filterbanks. The second one relates to the low-pass filter used in these approaches. These modifications consistently improve performances for both approaches and remove the need for a careful initialization in scattering-based trainable filterbanks. In particular, we show a consistent improvement in word error rate of the trainable filterbanks relatively to comparable mel-filterbanks. It is the first time end-to-end models trained from the raw signal significantly outperform mel-filterbanks on a large vocabulary task with clean recording conditions.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio

🧭 Keyword Pioneer — trainable filterbank

🐣 Hot Topic Early Bird — end-to-end speech recognition

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Natural Language Processing, Security & Privacy, Speech & Audio

Authors

Neil Zeghidour , Nicolas Usunier , Gabriel Synnaeve , Ronan Collobert , Emmanuel Dupoux

Topics

Deep Learning > Techniques > Model Architecture Speech & Audio > Recognition > Automatic Speech Recognition

Keywords

instance normalization end-to-end speech recognition raw waveform gammatone filterbank trainable filterbank

Download PDF

Related papers

HoloCompanion: An MR Friend for EveryOne 2018

Estimation of the Vocal Tract Length of Vowel Sounds Based on the Frequency of the Significant Spectral Valley 2018

Deep Learning Techniques for Koala Activity Detection 2018

An Exploration of Local Speaking Rate Variations in Mandarin Read Speech 2018

Acoustic Analysis of Whispery Voice Disguise in Mandarin Chinese 2018