2022 INTERSPEECH INTERSPEECH 2022

Streamable Speech Representation Disentanglement and Multi-Level Prosody Modeling for Live One-Shot Voice Conversion

Abstract

This paper takes efforts to tackle the challenge of "live” one-shot voice conversion (VC), which performs conversion across arbitrary speakers in a streaming way while retaining high intelligibility and naturalness. We propose a hybrid unsupervised and supervised learning based VC model with a two-stage model training strategy. Specially, we first employ an unsupervised disentanglement framework to separate speech representations of different granularity using mutual information constraint and vector quantization technique. Then we augment linguistic content modeling with a supervised ASR acoustic encoder. To perform live conversion, we design the model with streamable neural networks and run the model in streaming mode with sliding windows. Experimental results demonstrate that our proposed method achieves comparable performance on speech naturalness, intelligibility and speaker similarity with offline VC solutions, with sufficient efficiency for practical real-time applications. Audio samples are available online for demonstration.

πŸŒ‰ Interdisciplinary Bridge β€” Artificial Intelligence and Deep Learning and Machine Learning and Speech & Audio
🧭 Keyword Pioneer β€” streamable neural network
🐝 Cross-Pollinator β€” Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio