Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

Guangyan Zhang; Thomas Merritt; Sam Ribeiro; Biel Tura-Vecino; Kayoko Yanagisawa; Kamil Pokora; Abdelhamid Ezzerg; Sebastian Cygert; Ammar Abbas; Piotr Biliński; Roberto Barra-Chicote; Daniel Korzekwa; Jaime Lorenzo-Trueba

2023 INTERSPEECH INTERSPEECH 2023

Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

Abstract

Neural text-to-speech systems are often optimized on L1/L2 losses, which make strong assumptions about the distributions of the target data space. Aiming to improve those assumptions, Normalizing Flows and Diffusion Probabilistic Models were recently proposed as alternatives. In this paper, we compare traditional L1/L2-based approaches to diffusion and flow-based approaches for the tasks of prosody and mel-spectrogram prediction for text-to-speech synthesis. We use a prosody model to generate log-f0 and duration features, which are used to condition an acoustic model that generates mel-spectrograms. Experimental results demonstrate that the flow-based model achieves the best performance for spectrogram prediction, improving over equivalent diffusion and L1 models. Meanwhile, both diffusion and flow-based prosody predictors result in significant improvements over a typical L2-trained prosody models.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Guangyan Zhang , Thomas Merritt , Sam Ribeiro , Biel Tura-Vecino , Kayoko Yanagisawa , Kamil Pokora , Abdelhamid Ezzerg , Sebastian Cygert , Ammar Abbas , Piotr Biliński , Roberto Barra-Chicote , Daniel Korzekwa , Jaime Lorenzo-Trueba

Topics

Machine Learning > Core Methods > Representation Learning Deep Learning > Models > Diffusion Models Deep Learning > Models > Generative Models Speech & Audio > Synthesis > Text-to-Speech

Keywords

acoustic modeling diffusion model normalizing flow prosody modeling mel-spectrogram prediction

Download PDF

Audio-Visual Praise Estimation for Conversational Video based on Synchronization-Guided Multimodal Transformer 2023

Improving the response timing estimation for spoken dialogue systems by reducing the effect of speech recognition delay 2023

Improving Code-Switching and Name Entity Recognition in ASR with Speech Editing based Data Augmentation 2023

What are differences? Comparing DNN and Human by Their Performance and Characteristics in Speaker Age Estimation 2023

Comparing normalizing flows and diffusion models for prosody and acoustic modelling in text-to-speech

Abstract

Authors

Topics

Keywords

Related papers