Speech-to-Singing Conversion Based on Boundary Equilibrium GAN

Da-Yi Wu; Yi-Hsuan Yang

2020 INTERSPEECH INTERSPEECH 2020

Speech-to-Singing Conversion Based on Boundary Equilibrium GAN

Abstract

This paper investigates the use of generative adversarial network (GAN)-based models for converting a speech signal into a singing one, without reference to the phoneme sequence underlying the speech. This is achieved by viewing speech-to-singing conversion as a style transfer problem. Specifically, given a speech input, and the F0 contour of the target singing output, the proposed model generates the spectrogram of a singing signal with a progressive-growing encoder/decoder architecture. Moreover, the model uses a boundary equilibrium GAN loss term such that it can learn from both paired and unpaired data. The spectrogram is finally converted into wave with a separate GAN-based vocoder. Our quantitative and qualitative analysis show that the proposed model generates singing voices with much higher naturalness than an existing non adversarially-trained baseline.

🌉 Interdisciplinary Bridge — Computer Vision and Machine Learning

🧭 Keyword Pioneer — speech-to-singing conversion

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Da-Yi Wu , Yi-Hsuan Yang

Topics

Machine Learning > Learning Types > Adversarial Learning Machine Learning > Application Areas > Domain Adaptation Deep Learning > Models > Generative Models Computer Vision > Generation > Image Translation Speech & Audio > Synthesis > Speech Enhancement Deep Learning > Learning Types > Adversarial Learning

Keywords

style transfer voice conversion speech synthesis generative adversarial network spectrogram generation speech-to-singing conversion neural vocoder boundary equilibrium gan

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020