2020 INTERSPEECH INTERSPEECH 2020

Sparse Mixture of Local Experts for Efficient Speech Enhancement

Abstract

This work proposes a novel approach for reducing the computational complexity of speech denoising neural networks by using a sparsely active ensemble topology. In our ensemble networks, a gating module classifies an input noisy speech signal either by identifying speaker gender or by estimating signal degradation, and exclusively assigns it to a best-case specialist module, optimized to denoise a particular subset of the training data. This approach extends the hypothesis that speech denoising can be simplified if it is split into non-overlapping subproblems, contrasting earlier approaches that train large generalist neural networks to address a wide range of noisy speech data. We compare a baseline recurrent network against an ensemble of similarly designed, but smaller networks. Each network module is trained independently and combined to form a naïve ensemble. This can be further fine-tuned using a sparsity parameter to improve performance. Our experiments on noisy speech data — generated by mixing LibriSpeech and MUSAN datasets — demonstrate that a fine-tuned sparsely active ensemble can outperform a generalist using significantly fewer calculations. The key insight of this paper, leveraging model selection as a form of network compression, may be used to supplement already-existing deep learning methods for speech denoising.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning
🧭 Keyword Pioneer — signal degradation
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio