Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition

Juncheng Wang; Chao Xu; Cheng Yu; Lei Shang; Zhe Hu; Shujun WANG; Liefeng Bo

2025 CVPR CVPR 2025

Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition

Abstract

Video-to-audio generation is essential for synthesizing realistic audio tracks that synchronize effectively with silent videos.Following the perspective of extracting essential signals from videos that can precisely control the mature text-to-audio generative diffusion models, this paper presents how to balance the representation of mel-spectrograms in terms of completeness and complexity through a new approach called Mel Quantization-Continuum Decomposition (Mel-QCD).We decompose the mel-spectrogram into three distinct types of signals, employing quantization or continuity to them, we can effectively predict them from video by a devised video-to-all (V2X) predictor.Then, the predicted signals are recomposed and fed into a ControlNet, along with a textual inversion design, to control the audio generation process.Our proposed Mel-QCD method demonstrates state-of-the-art performance across eight metrics, evaluating dimensions such as quality, synchronization, and semantic consistency.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Speech & Audio

🧭 Keyword Pioneer — video-to-all predictor

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Juncheng Wang , Chao Xu , Cheng Yu , Lei Shang , Zhe Hu , Shujun WANG , Liefeng Bo

Topics

Deep Learning > Models > Diffusion Models Computer Vision > Generation > Video Generation Speech & Audio > Synthesis Speech & Audio > Synthesis > Speech Synthesis

Keywords

video generation diffusion model audio synthesis mel spectrogram video-to-audio generation video-to-all predictor

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025