End-to-End Multilingual Automatic Dubbing via Duration-based Translation with Large Language Models

Hyun-Sik Won; DongJin Jeong; Hyunkyu Choi; Jinwon Kim

2025 EMNLP EMNLP 2025

End-to-End Multilingual Automatic Dubbing via Duration-based Translation with Large Language Models

Abstract

AbstractAutomatic dubbing (AD) aims to replace the original speech in a video with translated speech that maintains precise temporal alignment (isochrony). Achieving natural synchronization between dubbed speech and visual content remains challenging due to variations in speech durations across languages. To address this, we propose an end-to-end AD framework that leverages large language models (LLMs) to integrate translation and timing control seamlessly. At the core of our framework lies Duration-based Translation (DT), a method that dynamically predicts the optimal phoneme count based on source speech duration and iteratively adjusts the translation length accordingly. Our experiments on English, Spanish, and Korean language pairs demonstrate that our approach substantially improves speech overlap—achieving up to 24% relative gains compared to translations without explicit length constraints—while maintaining competitive translation quality measured by COMET scores. Furthermore, our framework does not require language-specific tuning, ensuring practicality for multilingual dubbing scenarios.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Natural Language Processing and Speech & Audio

🧭 Keyword Pioneer — duration control

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hyun-Sik Won , DongJin Jeong , Hyunkyu Choi , Jinwon Kim

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Applications > Machine Translation Natural Language Processing > Resources & Methods > Multilingual NLP Speech & Audio > Synthesis > Text-to-Speech Speech & Audio > Synthesis > Speech Enhancement Natural Language Processing > Generation > Machine Translation Deep Learning > Models > Large Language Models

Keywords

machine translation multilingual nlp speech synthesis multilingual processing temporal alignment speech translation automatic dubbing duration prediction large language model duration control speech duration

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025