Non-Parallel Voice Conversion with Fewer Labeled Data by Conditional Generative Adversarial Networks

Minchuan Chen; Weijian Hou; Jun Ma; Shaojun Wang; Jing Xiao

2020 INTERSPEECH INTERSPEECH 2020

Non-Parallel Voice Conversion with Fewer Labeled Data by Conditional Generative Adversarial Networks

Abstract

Recent studies have shown remarkable success in voice conversion (VC) based on generative adversarial networks (GANs) without parallel data. In this paper, based on the conditional generative adversarial networks (CGANs), we propose a self- and semi-supervised method combined with mixup and data augmentation that allows non-parallel many-to-many voice conversion with fewer labeled data. In this method, the discriminator of CGANs learns to not only distinguish real/fake samples, but also classify attribute domains. We augment the discriminator with an auxiliary task to improve representation learning and introduce a training task to predict labels for the unlabeled samples. The proposed approach reduces the appetite for labeled data in voice conversion, which enables single generative network to implement many-to-many mapping between different voice domains. Experiment results show that the proposed method is able to achieve comparable voice quality and speaker similarity with only 10% of the labeled data.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Minchuan Chen , Weijian Hou , Jun Ma , Shaojun Wang , Jing Xiao

Topics

Machine Learning > Learning Types > Adversarial Learning Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Learning Types > Semi-Supervised Learning Deep Learning > Models > Generative Models

Keywords

semi-supervised learning data augmentation voice conversion conditional generative adversarial network non-parallel datum mixup data augmentation non-parallel voice conversion

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020