Text-aware and Context-aware Expressive Audiobook Speech Synthesis

Dake Guo; Xinfa Zhu; Liumeng Xue; Yongmao Zhang; Wenjie Tian; Lei Xie

2024 INTERSPEECH INTERSPEECH 2024

Text-aware and Context-aware Expressive Audiobook Speech Synthesis

Abstract

Recent advances in text-to-speech have significantly improved the expressiveness of synthetic speech.However, a major challenge remains in generating speech that captures the diverse styles exhibited by professional narrators in audiobooks,without relying on manual labele or reference speech. To address this, we propose a text-aware and context-aware(TACA)style modeling approach for expressive audiobook speech synthesis. We first establish a text-aware style space to cover diverse styles via contrastive learning with the supervision of the speech-style space. Meanwhile, we adopt a context encoder to incorporate cross-sentence information and the style embedding obtained from text. Finally, we introduce the context encoder to two typical TTS models, including VITS-based TTS and language model-based TTS. Experimental results show that our proposed approach can effectively capture diverse styles and coherent prosody,and thus improve naturalness and expressiveness in audiobook speech synthesis

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Dake Guo , Xinfa Zhu , Liumeng Xue , Yongmao Zhang , Wenjie Tian , Lei Xie

Topics

Speech & Audio > Synthesis > Text-to-Speech

Keywords

contrastive learning expressive speech

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024