Tight Integration of Spatial and Spectral Features for BSS with Deep Clustering Embeddings

Lukas Drude; Reinhold Haeb-Umbach

2017 INTERSPEECH INTERSPEECH 2017

Tight Integration of Spatial and Spectral Features for BSS with Deep Clustering Embeddings

Abstract

Recent advances in discriminatively trained mask estimation networks to extract a single source utilizing beamforming techniques demonstrate, that the integration of statistical models and deep neural networks (DNNs) are a promising approach for robust automatic speech recognition (ASR) applications. In this contribution we demonstrate how discriminatively trained embeddings on spectral features can be tightly integrated into statistical model-based source separation to separate and transcribe overlapping speech. Good generalization to unseen spatial configurations is achieved by estimating a statistical model at test time, while still leveraging discriminative training of deep clustering embeddings on a separate training set. We formulate an expectation maximization (EM) algorithm which jointly estimates a model for deep clustering embeddings and complex-valued spatial observations in the short time Fourier transform (STFT) domain at test time. Extensive simulations confirm, that the integrated model outperforms (a) a deep clustering model with a subsequent beamforming step and (b) an EM-based model with a beamforming step alone in terms of signal to distortion ratio (SDR) and perceptually motivated metric (PESQ) gains. ASR results on a reverberated dataset further show, that the aforementioned gains translate to reduced word error rates (WERs) even in reverberant environments.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — deep clustering embedding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

🐣 Hot Topic Early Bird — source separation

Authors

Lukas Drude , Reinhold Haeb-Umbach

Topics

Machine Learning > Core Methods > Embedding Learning Machine Learning > Application Areas > Domain Adaptation Deep Learning > Techniques > Pretraining Speech & Audio > Recognition > Automatic Speech Recognition Speech & Audio > Analysis > Speech Enhancement

Keywords

blind source separation source separation speech recognition expectation maximization deep clustering perceptual evaluation of speech quality deep clustering embedding mask estimation network beamforming technique signal to distortion ratio

Download PDF

Related papers

Description of the Munich-Passau Snore Sound Corpus (MPSSC) 2017

A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification 2017

Binaural Reverberant Speech Separation Based on Deep Neural Networks 2017

Building Audio-Visual Phonetically Annotated Arabic Corpus for Expressive Text to Speech 2017

A Comparison of Danish Listeners’ Processing Cost in Judging the Truth Value of Norwegian, Swedish, and English Sentences 2017