End-to-End Monaural Speech Separation with Multi-Scale Dynamic Weighted Gated Dilated Convolutional Pyramid Network

Ziqiang Shi; Huibin Lin; Liu Liu; Rujie Liu; Shoji Hayakawa; Shouji Harada; Jiqing Han

2019 INTERSPEECH INTERSPEECH 2019

End-to-End Monaural Speech Separation with Multi-Scale Dynamic Weighted Gated Dilated Convolutional Pyramid Network

Abstract

The monaural speech separation technology is far from satisfactory and has been a challenging task due to the interference of multiple sound sources. While deep dilated temporal convolutional networks (TCN) have been proved to be very effective in sequence modeling, this work investigates how to extend TCN to result in a new state-of-the-art approach for monaural speech separation. First a novel gating mechanisms is introduced and added to result in gated TCN. The gated activation can control the flow of information. Further in order to remedy the temporal scale variation problem caused by word length and pronunciation characteristics of different people, a multi-scale dynamic weighted pyramids gated TCNs is proposed, where a “weightor” network is used to determine the weights of different gated TCNs dynamically for each utterance. Since the strengths of different branches with different temporal receipt fields appear complementary, the combination outperforms single branch system. For the objective, we propose to train the network by directly optimizing utterance level signal-to-distortion ratio (SDR) in a permutation invariant training (PIT) style. Our experiments on the the WSJ0-2mix data corpus results in 18.4dB SDR improvement, which shows our proposed networks can leads to performance improvement on the speaker separation task.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — gated mechanism

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio

Authors

Ziqiang Shi , Huibin Lin , Liu Liu , Rujie Liu , Shoji Hayakawa , Shouji Harada , Jiqing Han

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Optimization & Theory > Neural Network Optimization

Keywords

speech separation gated mechanism dilated convolution temporal convolutional network multi-scale pyramid

Download PDF

Related papers

Using Real-Time Visual Biofeedback for Second Language Instruction 2019

VAE-Based Regularization for Deep Speaker Embedding 2019

End-to-End SpeakerBeam for Single Channel Target Speech Recognition 2019

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition 2019

Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile 2019