EZ-VC: Easy Zero-shot Any-to-Any Voice Conversion

Advait Joglekar; Divyanshu Singh; Rooshil Rohit Bhatia; Srinivasan Umesh

2025 EMNLP EMNLP 2025

EZ-VC: Easy Zero-shot Any-to-Any Voice Conversion

Abstract

AbstractVoice Conversion research in recent times has increasingly focused on improving the zero-shot capabilities of existing methods. Despite remarkable advancements, current architectures still tend to struggle in zero-shot cross-lingual settings. They are also often unable to generalize for speakers of unseen languages and accents. In this paper, we adopt a simple yet effective approach that combines discrete speech representations from self-supervised models with a non-autoregressive Diffusion-Transformer based conditional flow matching speech decoder. We show that this architecture allows us to train a voice-conversion model in a purely textless, self-supervised fashion. Our technique works without requiring multiple encoders to disentangle speech features. Our model also manages to excel in zero-shot cross-lingual settings even for unseen languages. We provide our code, model checkpoint and demo samples here: https://github.com/ez-vc/ez-vc

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Advait Joglekar , Divyanshu Singh , Rooshil Rohit Bhatia , Srinivasan Umesh

Topics

Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Models > Diffusion Models Artificial Intelligence > Learning Paradigms > Zero-Shot Learning

Keywords

zero-shot learning self-supervised learning diffusion model cross-lingual voice conversion

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025