Streamable Speech Representation Disentanglement and Multi-Level Prosody Modeling for Live One-Shot Voice Conversion

Haoquan Yang; Liqun Deng; Yu Ting Yeung; Nianzu Zheng; Yong Xu

2022 INTERSPEECH INTERSPEECH 2022

Streamable Speech Representation Disentanglement and Multi-Level Prosody Modeling for Live One-Shot Voice Conversion

Abstract

This paper takes efforts to tackle the challenge of "live” one-shot voice conversion (VC), which performs conversion across arbitrary speakers in a streaming way while retaining high intelligibility and naturalness. We propose a hybrid unsupervised and supervised learning based VC model with a two-stage model training strategy. Specially, we first employ an unsupervised disentanglement framework to separate speech representations of different granularity using mutual information constraint and vector quantization technique. Then we augment linguistic content modeling with a supervised ASR acoustic encoder. To perform live conversion, we design the model with streamable neural networks and run the model in streaming mode with sliding windows. Experimental results demonstrate that our proposed method achieves comparable performance on speech naturalness, intelligibility and speaker similarity with offline VC solutions, with sufficient efficiency for practical real-time applications. Audio samples are available online for demonstration.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Speech & Audio

🧭 Keyword Pioneer — streamable neural network

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Haoquan Yang , Liqun Deng , Yu Ting Yeung , Nianzu Zheng , Yong Xu

Topics

Artificial Intelligence > Learning Paradigms > Few-Shot Learning Machine Learning > Core Methods > Representation Learning Speech & Audio > Synthesis > Speech Enhancement Machine Learning > Learning Types > Representation Learning Deep Learning > Learning Types > Self-Supervised Learning

Keywords

vector quantization representation disentanglement prosody modeling one-shot voice conversion streamable neural network speech disentanglement streamable speech

Download PDF

Related papers

Example-based Explanations with Adversarial Attacks for Respiratory Sound Analysis 2022

Which Model is Best: Comparing Methods and Metrics for Automatic Laughter Detection in a Naturalistic Conversational Dataset 2022

Evidence of Onset and Sustained Neural Responses to Isolated Phonemes from Intracranial Recordings in a Voice-based Cursor Control Task 2022

Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications 2022

Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction 2022