Listen, Understand and Translate: Triple Supervision Decouples End-to-end Speech-to-text Translation

Qianqian Dong; Rong Ye; Mingxuan Wang; Hao Zhou; Shuang Xu; Bo Xu; Lei Li

2021 AAAI AAAI 2021

Listen, Understand and Translate: Triple Supervision Decouples End-to-end Speech-to-text Translation

Abstract

Abstract An end-to-end speech-to-text translation (ST) takes audio in a source language and outputs the text in a target language. Existing methods are limited by the amount of parallel corpus. Can we build a system to fully utilize signals in a parallel ST corpus? We are inspired by human understanding system which is composed of auditory perception and cognitive processing. In this paper, we propose Listen-Understand-Translate, (LUT), a unified framework with triple supervision signals to decouple the end-to-end speech-to-text translation task. LUT is able to guide the acoustic encoder to extract as much information from the auditory input. In addition, LUT utilizes a pre-trained BERT model to enforce the upper encoder to produce as much semantic information as possible, without extra data. We perform experiments on a diverse set of speech translation benchmarks, including Librispeech English-French, IWSLT English-German and TED English-Chinese. Our results demonstrate LUT achieves the state-of-the-art performance, outperforming previous methods. The code is available at https://github.com/dqqcasia/st.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Qianqian Dong , Rong Ye , Mingxuan Wang , Hao Zhou , Shuang Xu , Bo Xu , Lei Li

Topics

Natural Language Processing > Applications > Machine Translation

Keywords

neural machine translation end-to-end learning speech-to-text translation acoustic encoding

Download PDF

Related papers

Contextual Conditional Reasoning 2021

Attention Beam: An Image Captioning Approach (Student Abstract) 2021

Movie Summarization via Sparse Graph Construction 2021

Text Analysis for Understanding Symptoms of Social Anxiety in Student Veterans 2021

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs 2021