Textless Speech-to-Speech Translation With Limited Parallel Data

Anuj Diwan; Anirudh Srinivasan; David Harwath; Eunsol Choi

2024 EMNLP EMNLP 2024

Textless Speech-to-Speech Translation With Limited Parallel Data

Abstract

AbstractExisting speech-to-speech translation (S2ST) models fall into two camps: they either leverage text as an intermediate step or require hundreds of hours of parallel speech data. Both approaches are incompatible with textless languages or language pairs with limited parallel data. We present PFB, a framework for training textless S2ST models that require just dozens of hours of parallel speech data. We first pretrain a model on large-scale monolingual speech data, finetune it with a small amount of parallel speech data (20-60 hours), and lastly train with an unsupervised backtranslation objective. We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains (European Parliament, Common Voice, and All India Radio) with single-speaker synthesized speech. Evaluated using the ASR-BLEU metric, our models achieve reasonable performance on all three domains, with some being within 1-2 points of our higher-resourced topline.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing and Speech & Audio

🧭 Keyword Pioneer — monolingual speech

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Anuj Diwan , Anirudh Srinivasan , David Harwath , Eunsol Choi

Topics

Machine Learning > Learning Types > Unsupervised Learning Natural Language Processing > Applications > Machine Translation Speech & Audio > Recognition > Speech Recognition Speech & Audio > Synthesis > Speech Enhancement Natural Language Processing > Generation > Machine Translation Deep Learning > Techniques > Self-Supervised Learning Deep Learning > Learning Types > Transfer Learning Artificial Intelligence > Core AI > Speech Processing Speech & Audio > Recognition > Speech Translation

Keywords

unsupervised learning self-supervised learning speech synthesis low-resource language speech-to-speech translation parallel datum textless translation monolingual speech

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024