Investigation of Cost Function for Supervised Monaural Speech Separation

Yun Liu; Hui Zhang; Xueliang Zhang; Yuhang Cao

2019 INTERSPEECH INTERSPEECH 2019

Investigation of Cost Function for Supervised Monaural Speech Separation

Abstract

Speech separation aims to improve the speech quality of noisy speech. Deep learning based speech separation methods usually use mean square error (MSE) as the cost function, which measures the distance between model output and training target. However, the MSE does not match the evaluation metrics perfectly. Optimizing the MSE does not directly lead to improvement in the commonly used metrics, such as short-time objective intelligibility (STOI), perceptual evaluation of speech quality (PESQ), signal-to-noise ratio (SNR) and source-to-distortion ratio (SDR). In this study, we inspect some other cost function candidates which are based on divergence, e.g., Kullback-Leibler and Itakura-Saito divergence. A conjecture about the correlation between cost function and evaluation metrics is proposed and examined to explain why these cost functions behave differently. On the basis of the proposed conjecture, the optimal cost function candidate is selected. The experimental results validate our conjecture.

🐣 Hot Topic Early Bird — kullback-leibler divergence

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yun Liu , Hui Zhang , Xueliang Zhang , Yuhang Cao

Topics

Machine Learning > Optimization & Theory > Loss Functions Machine Learning > Optimization & Theory > Optimization

Keywords

speech separation kullback-leibler divergence itakura-saito divergence cost function mean square error perceptual evaluation of speech quality

Download PDF

Related papers

Using Real-Time Visual Biofeedback for Second Language Instruction 2019

VAE-Based Regularization for Deep Speaker Embedding 2019

End-to-End SpeakerBeam for Single Channel Target Speech Recognition 2019

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition 2019

Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile 2019