CCSRD: Content-Centric Speech Representation Disentanglement Learning for End-to-End Speech Translation

Xiaohu Zhao; Haoran Sun; Yikun Lei; Shaolin Zhu; Deyi Xiong

2023 EMNLP EMNLP 2023

CCSRD: Content-Centric Speech Representation Disentanglement Learning for End-to-End Speech Translation

Abstract

AbstractDeep neural networks have demonstrated their capacity in extracting features from speech inputs. However, these features may include non-linguistic speech factors such as timbre and speaker identity, which are not directly related to translation. In this paper, we propose a content-centric speech representation disentanglement learning framework for speech translation, CCSRD, which decomposes speech representations into content representations and non-linguistic representations via representation disentanglement learning. CCSRD consists of a content encoder that encodes linguistic content information from the speech input, a non-content encoder that models non-linguistic speech features, and a disentanglement module that learns disentangled representations with a cyclic reconstructor, feature reconstructor and speaker classifier trained in a multi-task learning way. Experiments on the MuST-C benchmark dataset demonstrate that CCSRD achieves an average improvement of +0.9 BLEU in two settings across five translation directions over the baseline, outperforming state-of-the-art end-to-end speech translation models and cascaded models.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xiaohu Zhao , Haoran Sun , Yikun Lei , Shaolin Zhu , Deyi Xiong

Topics

Deep Learning > Architectures > Transformers Speech & Audio > Synthesis Speech & Audio > Synthesis > Speech Enhancement Machine Learning > Learning Types > Multi-Task Learning Deep Learning > Learning Types > Multi-Task Learning Speech & Audio > Recognition > Speech Translation

Keywords

multi-task learning representation disentanglement speech translation end-to-end model speech representation end-to-end speech speaker disentanglement end-to-end speech translation content representation content encoder non-linguistic feature

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023