A Benchmark of Dynamical Variational Autoencoders Applied to Speech Spectrogram Modeling

Xiaoyu Bie; Laurent Girin; Simon Leglaive; Thomas Hueber; Xavier Alameda-Pineda

2021 INTERSPEECH INTERSPEECH 2021

A Benchmark of Dynamical Variational Autoencoders Applied to Speech Spectrogram Modeling

Abstract

The Variational Autoencoder (VAE) is a powerful deep generative model that is now extensively used to represent high-dimensional complex data via a low-dimensional latent space learned in an unsupervised manner. In the original VAE model, input data vectors are processed independently. In recent years, a series of papers have presented different extensions of the VAE to process sequential data, that not only model the latent space, but also model the temporal dependencies within a sequence of data vectors and corresponding latent vectors, relying on recurrent neural networks. We recently performed a comprehensive review of those models and unified them into a general class called Dynamical Variational Autoencoders (DVAEs). In the present paper, we present the results of an experimental benchmark comparing six of those DVAE models on the speech analysis-resynthesis task, as an illustration of the high potential of DVAEs for speech modeling.

🧭 Keyword Pioneer — dynamical variational autoencoder

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xiaoyu Bie , Laurent Girin , Simon Leglaive , Thomas Hueber , Xavier Alameda-Pineda

Topics

Deep Learning > Architectures > Autoencoders Deep Learning > Models > Generative Models Deep Learning > Models > Variational Inference

Keywords

sequence modeling recurrent neural network latent space speech modeling speech spectrogram dynamical variational autoencoder

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021