Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model

Ye Jia; Ron J. Weiss; Fadi Biadsy; Wolfgang Macherey; Melvin Johnson; zhifeng Chen; Yonghui Wu

2019 INTERSPEECH INTERSPEECH 2019

Direct Speech-to-Speech Translation with a Sequence-to-Sequence Model

Abstract

We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation. The network is trained end-to-end, learning to map speech spectrograms into target spectrograms in another language, corresponding to the translated content (in a different canonical voice). We further demonstrate the ability to synthesize translated speech using the voice of the source speaker. We conduct experiments on two Spanish-to-English speech translation datasets, and find that the proposed model slightly underperforms a baseline cascade of a direct speech-to-text translation model and a text-to-speech synthesis model, demonstrating the feasibility of the approach on this very challenging task.

🌉 Interdisciplinary Bridge — Natural Language Processing and Speech & Audio

🐣 Hot Topic Early Bird — speech translation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ye Jia , Ron J. Weiss , Fadi Biadsy , Wolfgang Macherey , Melvin Johnson , zhifeng Chen , Yonghui Wu

Topics

Natural Language Processing > Applications > Machine Translation Speech & Audio > Synthesis > Text-to-Speech

Keywords

attention mechanism voice conversion speech synthesis speech translation

Download PDF

Related papers

Using Real-Time Visual Biofeedback for Second Language Instruction 2019

VAE-Based Regularization for Deep Speaker Embedding 2019

End-to-End SpeakerBeam for Single Channel Target Speech Recognition 2019

Attention-Enhanced Connectionist Temporal Classification for Discrete Speech Emotion Recognition 2019

Attentive to Individual: A Multimodal Emotion Recognition Network with Personalized Attention Profile 2019