Non-Parallel Voice Conversion Using Weighted Generative Adversarial Networks

Dipjyoti Paul; Yannis Pantazis; Yannis Stylianou

2019 INTERSPEECH INTERSPEECH 2019

Non-Parallel Voice Conversion Using Weighted Generative Adversarial Networks

Abstract

In this paper, we suggest a novel way to train Generative Adversarial Network (GAN) for the purpose of non-parallel, many-to-many voice conversion. The goal of voice conversion (VC) is to transform speech from a source speaker to that of a target speaker without changing the phonetic contents. Based on ideas from Game Theory, we suggest to multiply the gradient of the Generator with suitable weights. Weights are calculated so that they increase the power of fake samples that fool the Discriminator resulting in a stronger Generator. Motivated by a recently presented GAN based approach for VC, StarGAN-VC, we suggest a variation to StarGAN, referred to as Weighted StarGAN (WeStarGAN). The experiments are conducted on standard CMU ARCTIC database. WeStarGAN-VC approach achieves significantly better relative performance and is clearly preferred over recently proposed StarGAN-VC method in terms of speech subjective quality and speaker similarity with 75% and 65% preference scores, respectively.

🧭 Keyword Pioneer — many-to-many transformation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Dipjyoti Paul , Yannis Pantazis , Yannis Stylianou

Topics

Machine Learning > Learning Types > Adversarial Learning Machine Learning > Application Areas > Data Augmentation

Keywords

voice conversion generative adversarial network speaker similarity non-parallel training many-to-many transformation

Download PDF

Related papers

Using Real-Time Visual Biofeedback for Second Language Instruction 2019

VAE-Based Regularization for Deep Speaker Embedding 2019

End-to-End SpeakerBeam for Single Channel Target Speech Recognition 2019

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition 2019

Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile 2019