VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

Kim Sung-Bin; Jeongsoo Choi; Puyuan Peng; Joon Son Chung; Tae-Hyun Oh; David Harwath

2025 ICCV ICCV 2025

VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models

Abstract

We present VoiceCraft-Dub, a novel approach for automated video dubbing that synthesizes high-quality speech from text and facial cues. This task has broad applications in filmmaking, multimedia creation, and assisting voice-impaired individuals. Building on the success of Neural Codec Language Models (NCLMs) for speech synthesis, our method extends their capabilities by incorporating video features, ensuring that synthesized speech is time-synchronized and expressively aligned with facial movements while preserving natural prosody. To inject visual cues, we design adapters to align facial features with the NCLM token space and introduce audio-visual fusion layers to merge audio-visual information within the NCLM framework. Additionally, we curate CelebV-Dub, a new dataset of expressive, real-world videos specifically designed for automated video dubbing. Extensive experiments show that our model achieves high-quality, intelligible, and natural speech synthesis with accurate lip synchronization, outperforming existing methods in human perception and performing favorably in objective evaluations. We also adapt VoiceCraft-Dub for the video-to-speech task, demonstrating its versatility for various applications.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Kim Sung-Bin , Jeongsoo Choi , Puyuan Peng , Joon Son Chung , Tae-Hyun Oh , David Harwath

Topics

Artificial Intelligence > Core AI > Multimodal Learning Speech & Audio > Synthesis > Speech Synthesis

Keywords

speech synthesis audio-visual fusion lip synchronization video dubbing neural codec

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025