ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

Huadai Liu; Rongjie Huang; Xuan Lin; Wenqiang Xu; Maozong Zheng; Hong Chen; Jinzheng He; Zhou Zhao

2023 EMNLP EMNLP 2023

ViT-TTS: Visual Text-to-Speech with Scalable Diffusion Transformer

Abstract

AbstractText-to-speech(TTS) has undergone remarkable improvements in performance, particularly with the advent of Denoising Diffusion Probabilistic Models (DDPMs). However, the perceived quality of audio depends not solely on its content, pitch, rhythm, and energy, but also on the physical environment.In this work, we propose ViT-TTS, the first visual TTS model with scalable diffusion transformers. ViT-TTS complement the phoneme sequence with the visual information to generate high-perceived audio, opening up new avenues for practical applications of AR and VR to allow a more immersive and realistic audio experience. To mitigate the data scarcity in learning visual acoustic information, we 1) introduce a self-supervised learning framework to enhance both the visual-text encoder and denoiser decoder; 2) leverage the diffusion transformer scalable in terms of parameters and capacity to learn visual scene information. Experimental results demonstrate that ViT-TTS achieves new state-of-the-art results, outperforming cascaded systems and other baselines regardless of the visibility of the scene. With low-resource data (1h, 2h, 5h), ViT-TTS achieves comparative results with rich-resource baselines.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Speech & Audio

🧭 Keyword Pioneer — visual text to speech

🐣 Hot Topic Early Bird — diffusion transformer

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Huadai Liu , Rongjie Huang , Xuan Lin , Wenqiang Xu , Maozong Zheng , Hong Chen , Jinzheng He , Zhou Zhao

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Models > Diffusion Models Speech & Audio > Synthesis > Text-to-Speech Deep Learning > Learning Types > Self-Supervised Learning

Keywords

self-supervised learning speech synthesis diffusion transformer denoising diffusion probabilistic model visual text-to-speech visual text to speech visual acoustic modeling visual acoustic information

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023