SVTS: Scalable Video-to-Speech Synthesis

Rodrigo Schoburg Carrillo de Mira; Alexandros Haliassos; Stavros Petridis; Bjorn W. Schuller; Maja Pantic

2022 INTERSPEECH INTERSPEECH 2022

SVTS: Scalable Video-to-Speech Synthesis

Abstract

Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip movements into the corresponding audio. This task has received an increasing amount of attention due to its self-supervised nature (i.e., can be trained without manual labelling) combined with the ever-growing collection of audio-visual data available online. Despite these strong motivations, contemporary video-to-speech works focus mainly on small- to medium-sized corpora with substantial constraints in both vocabulary and setting. In this work, we introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder, which converts the mel-frequency spectrograms into waveform audio. We achieve state-of-the art results for GRID and considerably outperform previous approaches on LRW. More importantly, by focusing on spectrogram prediction using a simple feedforward model, we can efficiently and effectively scale our method to very large and unconstrained datasets: To the best of our knowledge, we are the first to show intelligible results on the challenging LRS3 dataset.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning and Natural Language Processing and Speech & Audio

🧭 Keyword Pioneer — video-to-speech synthesis

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Rodrigo Schoburg Carrillo de Mira , Alexandros Haliassos , Stavros Petridis , Bjorn W. Schuller , Maja Pantic

Topics

Machine Learning > Learning Types > Self-Supervised Learning Computer Vision > Generation > Video Generation Natural Language Processing > Generation > Text Generation Speech & Audio > Synthesis > Text-to-Speech Deep Learning > Learning Types > Self-Supervised Learning

Keywords

self-supervised learning neural vocoder lip reading video-to-speech synthesis mel-frequency spectrogram spectrogram prediction

Download PDF

Related papers

Example-based Explanations with Adversarial Attacks for Respiratory Sound Analysis 2022

Which Model is Best: Comparing Methods and Metrics for Automatic Laughter Detection in a Naturalistic Conversational Dataset 2022

Evidence of Onset and Sustained Neural Responses to Isolated Phonemes from Intracranial Recordings in a Voice-based Cursor Control Task 2022

Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications 2022

Exploring the influence of fine-tuning data on wav2vec 2.0 model for blind speech quality prediction 2022