ZeroST: Zero-Shot Speech Translation

Sameer Khurana; Chiori Hori; Antoine Laurent; Gordon Wichern; Jonathan Le Roux

2024 INTERSPEECH INTERSPEECH 2024

ZeroST: Zero-Shot Speech Translation

Abstract

Our work introduces the Zero-Shot Speech Translation (ZeroST) framework, leveraging the synergistic potential of pre trained multilingual speech and text foundation models. Inspired by recent advances in multimodal foundation models, ZeroST utilizes a Query Transformer (Q-Former) to seamlessly connect a speech foundation model, such as Whisper or Massively Multilingual Speech (MMS), with a text translation model like No-Language-Left-Behind (NLLB). Our proposed learning framework enables the model to perform the speech-to-text translation in a zero-shot manner, bypassing the need for explicit supervision from expensive-to-collect speech-text translation pairs during training. Our extensive experiments, notably on the Europarl-ST benchmark, demonstrate that ZeroST achieves results comparable to those of a strong cascaded translation system and significantly outperforms baseline models. This promising approach paves the way for future research in zero-shot speech translation.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — query transformer

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Knowledge & Reasoning, Machine Learning, Natural Language Processing, Speech & Audio

Authors

Sameer Khurana , Chiori Hori , Antoine Laurent , Gordon Wichern , Jonathan Le Roux

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Zero-Shot Learning Natural Language Processing > Generation > Machine Translation Machine Learning > Learning Paradigms > Zero-Shot Learning Deep Learning > Models > Foundation Models

Keywords

zero-shot learning foundation model pre-trained model zero-shot translation multilingual speech speech translation query transformer

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024