Improved Zero-Shot Voice Conversion Using Explicit Conditioning Signals

Shahan Nercessian

2020 INTERSPEECH INTERSPEECH 2020

Improved Zero-Shot Voice Conversion Using Explicit Conditioning Signals

Abstract

In this paper, we propose a zero-shot voice conversion algorithm adding a number of conditioning signals to explicitly transfer prosody, linguistic content, and dynamics to conversion results. We show that the proposed approach improves overall conversion quality and generalization to out-of-domain samples relative to a baseline implementation of AutoVC, as the inclusion of conditioning signals can help reduce the burden on the model’s encoder to implicitly learn all of the different aspects involved in speech production. An ablation analysis illustrates the effectiveness of the proposed method.

🧭 Keyword Pioneer — linguistic content

🐣 Hot Topic Early Bird — zero-shot learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Natural Language Processing, Reinforcement Learning, Speech & Audio

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

Authors

Shahan Nercessian

Topics

Machine Learning > Learning Types > Zero-Shot Learning Speech & Audio > Synthesis > Text-to-Speech

Keywords

zero-shot learning voice conversion out-of-domain generalization speaker identity speaker adaptation zero-shot voice conversion prosody transfer linguistic content conditioning signal

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020