Face2Speech: Towards Multi-Speaker Text-to-Speech Synthesis Using an Embedding Vector Predicted from a Face Image

Shunsuke Goto; Kotaro Onishi; Yuki Saito; Kentaro Tachibana; Koichiro Mori

2020 INTERSPEECH INTERSPEECH 2020

Face2Speech: Towards Multi-Speaker Text-to-Speech Synthesis Using an Embedding Vector Predicted from a Face Image

Abstract

We are quite able to imagine voice characteristics of a speaker from his/her appearance, especially a face. In this paper, we propose Face2Speech, which generates speech with its characteristics predicted from a face image. This framework consists of three separately trained modules: a speech encoder, a multi-speaker text-to-speech (TTS), and a face encoder. The speech encoder outputs an embedding vector which is distinguishable from other speakers. The multi-speaker TTS synthesizes speech by using the embedding vector, and then the face encoder outputs the embedding vector of a speaker from the speaker’s face image. Experimental results of matching and naturalness tests demonstrate that synthetic speech generated with the face-derived embedding vector is comparable to one with the speech-derived embedding vector.

🧭 Keyword Pioneer — multi-speaker synthesis

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing, Speech & Audio

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Speech & Audio

Authors

Shunsuke Goto , Kotaro Onishi , Yuki Saito , Kentaro Tachibana , Koichiro Mori

Topics

Computer Vision > Analysis > Face Recognition Speech & Audio > Synthesis > Text-to-Speech Deep Learning > Learning Types > Multi-Modal Learning

Keywords

face recognition speaker embedding encoder-decoder architecture text-to-speech synthesis multi-speaker synthesis face encoder voice characteristic

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020