TTS Skins: Speaker Conversion via ASR

Adam Polyak; Lior Wolf; Yaniv Taigman

2020 INTERSPEECH INTERSPEECH 2020

TTS Skins: Speaker Conversion via ASR

Abstract

We present a fully convolutional wav-to-wav network for converting between speakers’ voices, without relying on text. Our network is based on an encoder-decoder architecture, where the encoder is pre-trained for the task of Automatic Speech Recognition, and a multi-speaker waveform decoder is trained to reconstruct the original signal in an autoregressive manner. We train the network on narrated audiobooks, and demonstrate multi-voice TTS in those voices, by converting the voice of a TTS robot.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Adam Polyak , Lior Wolf , Yaniv Taigman

Topics

Deep Learning > Architectures > Autoencoders Deep Learning > Architectures > Neural Networks Speech & Audio > Recognition > Automatic Speech Recognition Speech & Audio > Synthesis > Text-to-Speech Speech & Audio > Analysis > Speaker Verification

Keywords

automatic speech recognition autoregressive model autoregressive decoding encoder-decoder architecture multi-speaker synthesis speaker conversion waveform generation

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020