Knowledge Distillation from Self-Supervised Representation Learning Model with Discrete Speech Units for Any-to-Any Streaming Voice Conversion

Hiroki Kanagawa; Yusuke Ijima

2024 INTERSPEECH INTERSPEECH 2024

Knowledge Distillation from Self-Supervised Representation Learning Model with Discrete Speech Units for Any-to-Any Streaming Voice Conversion

Abstract

SSL models like HuBERT and WavLM serve as effective content encoders for non-parallel voice conversion (VC), but their large size and design for offline operation make streaming use a challenge. Thus, we derive novel lightweight streaming VC using knowledge distillation (KD) from the SSL model. A promising SSL model and its vector quantizer are used as the teacher content encoder. The student content encoder predicts discrete content from the teacher, ensuring consistency within the KD framework. To stabilize the converted speech's prosody, a prosody predictor using content and speaker information is employed. A HiFi-GAN-like decoder generates waveforms from speaker, content, and prosody inputs. Our student VC leverages the SSL model's robust content encoding without relying on it for inferencing, enabling streaming operation. Evaluations on any-to-any VC tasks show our approach achieved naturalness comparable to modern offline VCs and the teacher with SSL model while being streamable.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hiroki Kanagawa , Yusuke Ijima

Topics

Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Application Areas > Knowledge Distillation Deep Learning > Architectures > Transformers

Keywords

knowledge distillation voice conversion neural codec discrete speech unit self-supervised representation learning

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024