Singing Voice Extraction with Attention-Based Spectrograms Fusion

Hao Shi; Longbiao Wang; Sheng Li; Chenchen Ding; Meng Ge; Nan Li; Jianwu Dang; Hiroshi Seki

2020 INTERSPEECH INTERSPEECH 2020

Singing Voice Extraction with Attention-Based Spectrograms Fusion

Abstract

We propose a novel attention mechanism-based spectrograms fusion system with minimum difference masks (MDMs) estimation for singing voice extraction. Compared with previous works that use a fully connected neural network, our system takes advantage of the multi-head attention mechanism. Specifically, we 1) try a variety of embedding methods of multiple spectrograms as the input of attention mechanisms, which can provide multi-scale correlation information between adjacent frames in the spectrograms; 2) add a regular term to loss function to obtain better continuity of spectrogram; 3) use the phase of the linear fusion waveform to reconstruct the final waveform, which can reduce the impact of the inconsistent spectrogram. Experiments on the MIR-1K dataset show that our system consistently improves the quantitative evaluation by the perceptual evaluation of speech quality, signal-to-distortion ratio, signal-to-interference ratio, and signal-to-artifact ratio.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio

🧭 Keyword Pioneer — singing voice extraction

🐣 Hot Topic Early Bird — source separation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hao Shi , Longbiao Wang , Sheng Li , Chenchen Ding , Meng Ge , Nan Li , Jianwu Dang , Hiroshi Seki

Topics

Machine Learning > Core Methods > Representation Learning Deep Learning > Models > Generative Models Speech & Audio > Synthesis > Speech Enhancement Deep Learning > Techniques > Transfer Learning

Keywords

source separation attention mechanism deep neural network multi-head attention mask estimation singing voice singing voice extraction minimum difference mask

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020