2020 INTERSPEECH INTERSPEECH 2020

Non-Parallel Voice Conversion with Fewer Labeled Data by Conditional Generative Adversarial Networks

Abstract

Recent studies have shown remarkable success in voice conversion (VC) based on generative adversarial networks (GANs) without parallel data. In this paper, based on the conditional generative adversarial networks (CGANs), we propose a self- and semi-supervised method combined with mixup and data augmentation that allows non-parallel many-to-many voice conversion with fewer labeled data. In this method, the discriminator of CGANs learns to not only distinguish real/fake samples, but also classify attribute domains. We augment the discriminator with an auxiliary task to improve representation learning and introduce a training task to predict labels for the unlabeled samples. The proposed approach reduces the appetite for labeled data in voice conversion, which enables single generative network to implement many-to-many mapping between different voice domains. Experiment results show that the proposed method is able to achieve comparable voice quality and speaker similarity with only 10% of the labeled data.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio