AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

Moayed Haji-Ali; Willi Menapace; Aliaksandr Siarohin; Ivan Skorokhodov; Alper Canberk; Kwot Sin Lee; Vicente Ordonez; Sergey Tulyakov

2025 ICCV ICCV 2025

AV-Link: Temporally-Aligned Diffusion Features for Cross-Modal Audio-Video Generation

Abstract

We propose AV-Link, a unified framework for Video-to-Audio (A2V) and Audio-to-Video (A2V) generation that leverages the activations of frozen video and audio diffusion models for temporally-aligned cross-modal conditioning. The key to our framework is a Fusion Block that facilitates bidirectional information exchange between video and audio diffusion models through temporally-aligned self-attention operations. Unlike prior work that uses dedicated models for A2V and V2A tasks and relies on pretrained feature extractors, AV-Link achieves both tasks in a single self-contained framework, directly leveraging features obtained by the complementary modality (i.e. video features to generate audio, or audio features to generate video). Extensive evaluations demonstrate that AV-Link achieves substantial improvements in audio-video synchronization, outperforming more expensive baselines such as MovieGen V2A model.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Moayed Haji-Ali , Willi Menapace , Aliaksandr Siarohin , Ivan Skorokhodov , Alper Canberk , Kwot Sin Lee , Vicente Ordonez , Sergey Tulyakov

Topics

Deep Learning > Models > Diffusion Models Computer Vision > Generation > Video Generation Computer Vision > Processing > Video Processing Computer Vision > Core AI > Multimodal Learning

Keywords

feature extraction video generation multimodal learning temporal alignment diffusion model audio generation cross-modal generation

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025