2017 INTERSPEECH INTERSPEECH 2017

A Comparison of Perceptually Motivated Loss Functions for Binary Mask Estimation in Speech Separation

Abstract

This work proposes and compares perceptually motivated loss functions for deep learning based binary mask estimation for speech separation. Previous loss functions have focused on maximising classification accuracy of mask estimation but we now propose loss functions that aim to maximise the hit minus false-alarm (HIT-FA) rate which is known to correlate more closely to speech intelligibility. The baseline loss function is binary cross-entropy (CE), a standard loss function used in binary mask estimation, which maximises classification accuracy. We propose first a loss function that maximises the HIT-FA rate instead of classification accuracy. We then propose a second loss function that is a hybrid between CE and HIT-FA, providing a balance between classification accuracy and HIT-FA rate. Evaluations of the perceptually motivated loss functions with the GRID database show improvements to HIT-FA rate and ESTOI across babble and factory noises. Further tests then explore application of the perceptually motivated loss functions to a larger vocabulary dataset.

πŸŒ‰ Interdisciplinary Bridge β€” Deep Learning and Machine Learning
πŸ“ˆ Trend Setter β€” Loss Functions
🧭 Keyword Pioneer β€” perceptually motivated loss
🐣 Hot Topic Early Bird β€” speech separation
🐝 Cross-Pollinator β€” Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Security & Privacy, Speech & Audio