Lightweight Zero-shot Text-to-Speech with Mixture of Adapters

Kenichi Fujita; Takanori Ashihara; Marc Delcroix; Yusuke Ijima

2024 INTERSPEECH INTERSPEECH 2024

Lightweight Zero-shot Text-to-Speech with Mixture of Adapters

Abstract

The advancements in zero-shot text-to-speech (TTS) methods, based on large-scale models, have demonstrated high fidelity in reproducing speaker characteristics. However, these models are too large for practical daily use. We propose a lightweight zero-shot TTS method using a mixture of adapters (MoA). Our proposed method incorporates MoA modules into the decoder and the variance adapter of a non-autoregressive TTS model. These modules enhance the ability to adapt a wide variety of speakers in a zero-shot manner by selecting appropriate adapters associated with speaker characteristics on the basis of speaker embeddings. Our method achieves high-quality speech synthesis with minimal additional parameters. Through objective and subjective evaluations, we confirmed that our method achieves better performance than the baseline with less than 40% of parameters at 1.9 times faster inference speed.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🧭 Keyword Pioneer — mixture of adapter

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Kenichi Fujita , Takanori Ashihara , Marc Delcroix , Yusuke Ijima

Topics

Machine Learning > Learning Types > Zero-Shot Learning Machine Learning > Application Areas > Knowledge Distillation Speech & Audio > Synthesis > Text-to-Speech Machine Learning > Learning Paradigms > Transfer Learning Deep Learning > Learning Types > Transfer Learning

Keywords

model compression zero-shot learning speaker embedding parameter-efficient fine-tuning speaker adaptation mixture of adapter

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024