CTC-based Non-autoregressive Textless Speech-to-Speech Translation

Qingkai Fang; Zhengrui Ma; Yan Zhou; Min Zhang; Yang Feng

2024 ACL ACL 2024

CTC-based Non-autoregressive Textless Speech-to-Speech Translation

Abstract

AbstractDirect speech-to-speech translation (S2ST) has achieved impressive translation quality, but it often faces the challenge of slow decoding due to the considerable length of speech sequences. Recently, some research has turned to non-autoregressive (NAR) models to expedite decoding, yet the translation quality typically lags behind autoregressive (AR) models significantly. In this paper, we investigate the performance of CTC-based NAR models in S2ST, as these models have shown impressive results in machine translation. Experimental results demonstrate that by combining pretraining, knowledge distillation, and advanced NAR training techniques such as glancing training and non-monotonic latent alignments, CTC-based NAR models achieve translation quality comparable to the AR model, while preserving up to 26.81× decoding speedup.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing and Speech & Audio

🧭 Keyword Pioneer — textless translation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Qingkai Fang , Zhengrui Ma , Yan Zhou , Min Zhang , Yang Feng

Topics

Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Application Areas > Knowledge Distillation Natural Language Processing > Applications > Machine Translation Speech & Audio > Synthesis > Text-to-Speech Deep Learning > Models > Transformers Speech & Audio > Synthesis > Speech Synthesis Artificial Intelligence > Core AI > Speech Processing

Keywords

knowledge distillation connectionist temporal classification speech-to-speech translation non-autoregressive translation non-autoregressive model decoding speedup textless translation ctc decoding glancing training

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024