2024 INTERSPEECH INTERSPEECH 2024

Text-aware and Context-aware Expressive Audiobook Speech Synthesis

Abstract

Recent advances in text-to-speech have significantly improved the expressiveness of synthetic speech.However, a major challenge remains in generating speech that captures the diverse styles exhibited by professional narrators in audiobooks,without relying on manual labele or reference speech. To address this, we propose a text-aware and context-aware(TACA)style modeling approach for expressive audiobook speech synthesis. We first establish a text-aware style space to cover diverse styles via contrastive learning with the supervision of the speech-style space. Meanwhile, we adopt a context encoder to incorporate cross-sentence information and the style embedding obtained from text. Finally, we introduce the context encoder to two typical TTS models, including VITS-based TTS and language model-based TTS. Experimental results show that our proposed approach can effectively capture diverse styles and coherent prosody,and thus improve naturalness and expressiveness in audiobook speech synthesis

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio