2019 INTERSPEECH INTERSPEECH 2019

End-to-End Monaural Speech Separation with Multi-Scale Dynamic Weighted Gated Dilated Convolutional Pyramid Network

Abstract

The monaural speech separation technology is far from satisfactory and has been a challenging task due to the interference of multiple sound sources. While deep dilated temporal convolutional networks (TCN) have been proved to be very effective in sequence modeling, this work investigates how to extend TCN to result in a new state-of-the-art approach for monaural speech separation. First a novel gating mechanisms is introduced and added to result in gated TCN. The gated activation can control the flow of information. Further in order to remedy the temporal scale variation problem caused by word length and pronunciation characteristics of different people, a multi-scale dynamic weighted pyramids gated TCNs is proposed, where a “weightor” network is used to determine the weights of different gated TCNs dynamically for each utterance. Since the strengths of different branches with different temporal receipt fields appear complementary, the combination outperforms single branch system. For the objective, we propose to train the network by directly optimizing utterance level signal-to-distortion ratio (SDR) in a permutation invariant training (PIT) style. Our experiments on the the WSJ0-2mix data corpus results in 18.4dB SDR improvement, which shows our proposed networks can leads to performance improvement on the speaker separation task.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning
🧭 Keyword Pioneer — gated mechanism
🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio