Sample-Efficient Diffusion for Text-To-Speech Synthesis

Justin Lovelace; Soham Ray; Kwangyoun Kim; Kilian Q. Weinberger; Felix Wu

2024 INTERSPEECH INTERSPEECH 2024

Sample-Efficient Diffusion for Text-To-Speech Synthesis

Abstract

This work introduces Sample-Efficient Speech Diffusion (SESD), an algorithm for effective speech synthesis in modest data regimes through latent diffusion. It is based on a novel diffusion architecture, that we call U-Audio Transformer (U-AT), that efficiently scales to long sequences and operates in the latent space of a pre-trained audio autoencoder. Conditioned on character-aware language model representations, SESD achieves impressive results despite training on less than 1k hours of speech – far less than current state-of-the-art systems. In fact, it synthesizes more intelligible speech than thestate-of-the-art auto-regressive model, VALL-E, while using less than 2% the training data. Our implementation is available at https://github.com/justinlovelace/SESD.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Justin Lovelace , Soham Ray , Kwangyoun Kim , Kilian Q. Weinberger , Felix Wu

Topics

Speech & Audio > Synthesis > Text-to-Speech

Keywords

diffusion model latent diffusion auto-regressive model text-to-speech synthesis

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024