2020 INTERSPEECH INTERSPEECH 2020

BLSTM-Driven Stream Fusion for Automatic Speech Recognition: Novel Methods and a Multi-Size Window Fusion Example

Abstract

Optimal fusion of streams for ASR is a nontrivial problem. Recently, so-called posterior-in-posterior-out (PIPO-)BLSTMs have been proposed that serve as state sequence enhancers and have highly attractive training properties. In this work, we adopt the PIPO-BLSTMs and employ them in the context of stream fusion for ASR. Our contributions are the following: First, we show the positive effect of a PIPO-BLSTM as state sequence enhancer for various stream fusion approaches. Second, we confirm the advantageous context-free (CF) training property of the PIPO-BLSTM for all investigated fusion approaches. Third, we show with a fusion example of two streams, stemming from different short-time Fourier transform window lengths, that all investigated fusion approaches take profit. Finally, the turbo fusion approach turns out to be best, employing a CF-type PIPO-BLSTM with a novel iterative augmentation in training.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio
🧭 Keyword Pioneer — stream fusion
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing, Reinforcement Learning, Speech & Audio