Speech & Audio › Synthesis ›

Speech Synthesis

164 directly classified papers

Papers per year

Papers

Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment ACL 2025

InSerter: Speech Instruction Following with Unsupervised Interleaved Pre-training ACL 2025

Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model AAAI 2025

ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control ACL 2025

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation ACL 2025

SimulS2S-LLM: Unlocking Simultaneous Inference of Speech LLMs for Speech-to-Speech Translation ACL 2025

TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching AAAI 2025

Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech AAAI 2025

Takin-VC: Expressive Zero-Shot Voice Conversion via Adaptive Hybrid Content Encoding and Enhanced Timbre Modeling ACL 2025

In-the-wild Audio Spatialization with Flexible Text-guided Localization ACL 2025

FlashAudio: Rectified Flow for Fast and High-Fidelity Text-to-Audio Generation ACL 2025

PointTalk: Audio-Driven Dynamic Lip Point Cloud for 3D Gaussian-based Talking Head Synthesis AAAI 2025

ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Reordering AAAI 2025

Rhythm Controllable and Efficient Zero-Shot Voice Conversion via Shortcut Flow Matching ACL 2025

VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models ICCV 2025

DIDiffGes: Decoupled Semi-Implicit Diffusion Models for Real-time Gesture Generation from Speech AAAI 2025

EmoReg: Directional Latent Vector Modeling for Emotional Intensity Regularization in Diffusion-based Voice Conversion AAAI 2025

CSSinger: End-to-End Chunkwise Streaming Singing Voice Synthesis System Based on Conditional Variational Autoencoder AAAI 2025

Cauchy Diffusion: A Heavy-tailed Denoising Diffusion Probabilistic Model for Speech Synthesis AAAI 2025

FaceSpeak: Expressive and High-Quality Speech Synthesis from Human Portraits of Different Styles AAAI 2025

DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation AAAI 2025

ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis AAAI 2025

Drop the Beat! Freestyler for Accompaniment Conditioned Rapping Voice Generation AAAI 2025

StableVC: Style Controllable Zero-Shot Voice Conversion with Conditional Flow Matching AAAI 2025

DNASpeech: A Contextualized and Situated Text-to-Speech Dataset with Dialogues, Narratives and Actions ACL 2025