Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis Through Audio Analysis

Noé Tits; Fengna Wang; Kevin El Haddad; Vincent Pagel; Thierry Dutoit

2019 INTERSPEECH INTERSPEECH 2019

Visualization and Interpretation of Latent Spaces for Controlling Expressive Speech Synthesis Through Audio Analysis

Abstract

The field of Text-to-Speech has experienced huge improvements last years benefiting from deep learning techniques. Producing realistic speech becomes possible now. As a consequence, the research on the control of the expressiveness, allowing to generate speech in different styles or manners, has attracted increasing attention lately. Systems able to control style have been developed and show impressive results. However the control parameters often consist of latent variables and remain complex to interpret. In this paper, we analyze and compare different latent spaces and obtain an interpretation of their influence on expressive speech. This will enable the possibility to build controllable speech synthesis systems with an understandable behaviour.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🧭 Keyword Pioneer — expressiveness control

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Noé Tits , Fengna Wang , Kevin El Haddad , Vincent Pagel , Thierry Dutoit

Topics

Artificial Intelligence > Core AI > Interpretability Deep Learning > Models > Generative Models Speech & Audio > Synthesis > Text-to-Speech

Keywords

speech synthesis model interpretability audio analysis latent space expressive speech synthesis expressiveness control

Download PDF

Related papers

Using Real-Time Visual Biofeedback for Second Language Instruction 2019

VAE-Based Regularization for Deep Speaker Embedding 2019

End-to-End SpeakerBeam for Single Channel Target Speech Recognition 2019

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition 2019

Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile 2019