VMAs: Video-to-Music Generation via Semantic Alignment in Web Music Videos

Yan-Bo Lin; Yu Tian; Linjie Yang; Gedas Bertasius; Heng Wang

2025 WACV WACV 2025

VMAs: Video-to-Music Generation via Semantic Alignment in Web Music Videos

Abstract

We present a framework for learning to generate background music from video inputs. Unlike existing works that rely on symbolic musical annotations which are limited in quantity and diversity our method leverages large-scale web videos accompanied by background music. This enables our model to learn to generate realistic and diverse music. To accomplish this goal we develop a generative video-music Transformer with a novel semantic video-music alignment scheme. Our model uses a joint autoregressive and contrastive learning objective which encourages the generation of music aligned with high-level video content. We also introduce a novel video-beat alignment scheme to match the generated music beats with the low-level motions in the video. Lastly to capture fine-grained visual cues in a video needed for realistic background music generation we introduce a new temporal video encoder architecture allowing us to efficiently process videos consisting of many densely sampled frames. We train our framework on our newly curated DISCO-MV dataset consisting of 2.2M video-music samples which is orders of magnitude larger than any prior datasets used for video music generation. Our method outperforms existing approaches on the DISCO-MV and MusicCaps datasets according to various music generation evaluation metrics including human evaluation. Results are available at https://genjib.github.io/project_page/VMAs/index.html

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — video-music transformer

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yan-Bo Lin , Yu Tian , Linjie Yang , Gedas Bertasius , Heng Wang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Contrastive Learning Computer Vision > Generation > Video Generation Deep Learning > Learning Types > Multi-Modal Learning

Keywords

contrastive learning video generation multimodal learning semantic alignment music generation video encoder video-to-music generation video-music transformer joint autoregressive

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025