DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction

Junwen Xiong; Peng Zhang; Tao You; Chuanyue Li; Wei Huang; Yufei Zha

2024 CVPR CVPR 2024

DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction

Abstract

Audio-visual saliency prediction can draw support from diverse modality complements but further performance enhancement is still challenged by customized architectures as well as task-specific loss functions. In recent studies denoising diffusion models have shown more promising in unifying task frameworks owing to their inherent ability of generalization. Following this motivation a novel Diffusion architecture for generalized audio-visual Saliency prediction (DiffSal) is proposed in this work which formulates the prediction problem as a conditional generative task of the saliency map by utilizing input audio and video as the conditions. Based on the spatio-temporal audio-visual features an extra network Saliency-UNet is designed to perform multi-modal attention modulation for progressive refinement of the ground-truth saliency map from the noisy map. Extensive experiments demonstrate that the proposed DiffSal can achieve excellent performance across six challenging audio-visual benchmarks with an average relative improvement of 6.3% over the previous state-of-the-art results by six metrics.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Junwen Xiong , Peng Zhang , Tao You , Chuanyue Li , Wei Huang , Yufei Zha

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Models > Diffusion Models Computer Vision > Analysis > Scene Understanding Computer Vision > Processing > Video Understanding Computer Vision > Core AI > Multimodal Learning Machine Learning > Learning Types > Multimodal Learning Deep Learning > Learning Types > Multi-Modal Learning

Keywords

image generation video generation video prediction multimodal learning audio-visual learning conditional generation generative model diffusion model multimodal attention multi-modal attention saliency prediction generative task

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024