2024 INTERSPEECH INTERSPEECH 2024

Knowledge Distillation from Self-Supervised Representation Learning Model with Discrete Speech Units for Any-to-Any Streaming Voice Conversion

Abstract

SSL models like HuBERT and WavLM serve as effective content encoders for non-parallel voice conversion (VC), but their large size and design for offline operation make streaming use a challenge. Thus, we derive novel lightweight streaming VC using knowledge distillation (KD) from the SSL model. A promising SSL model and its vector quantizer are used as the teacher content encoder. The student content encoder predicts discrete content from the teacher, ensuring consistency within the KD framework. To stabilize the converted speech's prosody, a prosody predictor using content and speaker information is employed. A HiFi-GAN-like decoder generates waveforms from speaker, content, and prosody inputs. Our student VC leverages the SSL model's robust content encoding without relying on it for inferencing, enabling streaming operation. Evaluations on any-to-any VC tasks show our approach achieved naturalness comparable to modern offline VCs and the teacher with SSL model while being streamable.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio