2023 INTERSPEECH INTERSPEECH 2023

Knowledge Distillation on Joint Task End-to-End Speech Translation

Abstract

An End-to-End Speech Translation (E2E-ST) model takes input audio in one language and directly produces output text in another language. The model requires to learn both speech-to-text modality conversion and translation tasks, which demands a large architecture for effective learning of this joint task. Yet, to the best of our knowledge, we are the first to optimize compression of E2E-ST models. In this work, we explore knowledge distillation for a cross-modality joint-task E2E-ST system from 3 dimensions: 1) student architecture and weight initialization scheme, 2) importance of loss terms associated with different tasks and data modalities, 3) knowledge distillation training scheme customized for the multi-task/module model. Comparing with the full size model, our compressed model's encoder and decoder size are 50% smaller, while it retains 90% and > 95% performance on speech translation task and machine translation task respectively on MUST-C enโ†’de testset.

๐ŸŒ‰ Interdisciplinary Bridge โ€” Deep Learning and Machine Learning and Natural Language Processing
๐Ÿ Cross-Pollinator โ€” Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio