Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

Nameer Hirschkind; Xiao Yu; Mahesh Kumar Nandwana; Joseph Liu; Eloi DuBois; Dao Le; Nicolas Thiebaut; Colin Sinclair; Kyle Spence; Charles Shang; Zoe Abrams; Morgan McGuire

2024 INTERSPEECH INTERSPEECH 2024

Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation

Abstract

We introduce DiffuseST, a low-latency, direct speech-to-speech translation system capable of preserving the input speaker's voice zero-shot while translating from multiple source languages into English. We experiment with the synthesizer component of the architecture, comparing a Tacotron-based synthesizer to a novel diffusion-based synthesizer. We find the diffusion-based synthesizer to improve MOS and PESQ audio quality metrics by 23% each and speaker similarity by 5% while maintaining comparable BLEU scores. Despite having more than double the parameter count, the diffusion synthesizer has lower latency, allowing the entire model to run more than 5 times faster than real-time.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Nameer Hirschkind , Xiao Yu , Mahesh Kumar Nandwana , Joseph Liu , Eloi DuBois , Dao Le , Nicolas Thiebaut , Colin Sinclair , Kyle Spence , Charles Shang , Zoe Abrams , Morgan McGuire

Topics

Machine Learning > Learning Types > Zero-Shot Learning Machine Learning > Optimization & Theory > Optimization Machine Learning > Application Areas > Domain Adaptation

Keywords

zero-shot learning multilingual translation diffusion model speech-to-speech translation voice cloning

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024