2023 INTERSPEECH INTERSPEECH 2023

LightClone: Speaker-guided Parallel Subnet Selection for Few-shot Voice Cloning

Abstract

Large-scale few-shot voice cloning service faces three main challenges: model storage for huge number of users, fast model training and real-time synthesis. They all involve model size directly. It is noted that few-shot voice cloning usually has much bigger model size than common TTS trained by one speaker corpus, since its source model needs more parameters to hold the characteristics of various speakers. It also indicates that a high quality TTS model for one voice could be much smaller. To reduce model size of voice cloning, speaker-guided parallel subnet selection (SG-PSS) is proposed in this paper. In adaptation phase, only one subnet is selected from parallel ones of source model for target speaker. By this method, adaptation training and inference can be much faster. Experiment results show that the proposed approach achieves 4x model compression ratio, 3x inference speedup and even slightly better performance in voice quality and speaker similarity in comparison with baseline.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Speech & Audio
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio