Large-Scale Self- and Semi-Supervised Learning for Speech Translation

Changhan Wang; Anne Wu; Juan Pino; Alexei Baevski; Michael Auli; Alexis CONNEAU

2021 INTERSPEECH INTERSPEECH 2021

Large-Scale Self- and Semi-Supervised Learning for Speech Translation

Abstract

In this paper, we improve speech translation (ST) through effectively leveraging large quantities of unlabeled speech and text data in different and complementary ways. We explore both pretraining and self-training by using the large Libri-Light speech audio corpus and language modeling with CommonCrawl. Our experiments improve over the previous state of the art by 2.8 BLEU on average on all four considered CoVoST 2 language pairs via a simple recipe of combining wav2vec 2.0 pretraining, a single iteration of self-training and decoding with a language model. Different from existing work, our approach does not leverage any other supervision than ST data. Code and models are publicly released.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Changhan Wang , Anne Wu , Juan Pino , Alexei Baevski , Michael Auli , Alexis CONNEAU

Topics

Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Learning Types > Semi-Supervised Learning Natural Language Processing > Applications > Machine Translation Speech & Audio > Recognition > Speech Recognition Deep Learning > Learning Types > Self-Supervised Learning Deep Learning > Learning Types > Semi-Supervised Learning

Keywords

semi-supervised learning self-supervised learning speech recognition language modeling speech translation wav2vec 2.0

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021