Knowledge Distillation from Self-Supervised Representation Learning Model with Discrete Speech Units for Any-to-Any Streaming Voice Conversion
Abstract
SSL models like HuBERT and WavLM serve as effective content encoders for non-parallel voice conversion (VC), but their large size and design for offline operation make streaming use a challenge. Thus, we derive novel lightweight streaming VC using knowledge distillation (KD) from the SSL model. A promising SSL model and its vector quantizer are used as the teacher content encoder. The student content encoder predicts discrete content from the teacher, ensuring consistency within the KD framework. To stabilize the converted speech's prosody, a prosody predictor using content and speaker information is employed. A HiFi-GAN-like decoder generates waveforms from speaker, content, and prosody inputs. Our student VC leverages the SSL model's robust content encoding without relying on it for inferencing, enabling streaming operation. Evaluations on any-to-any VC tasks show our approach achieved naturalness comparable to modern offline VCs and the teacher with SSL model while being streamable.