Panoramic Video Salient Object Detection with Ambisonic Audio Guidance

Xiang Li; Haoyuan Cao; Shijie Zhao; Junlin Li; Li Zhang; Bhiksha Raj

2023 AAAI AAAI 2023

Panoramic Video Salient Object Detection with Ambisonic Audio Guidance

Abstract

Abstract Video salient object detection (VSOD), as a fundamental computer vision problem, has been extensively discussed in the last decade. However, all existing works focus on addressing the VSOD problem in 2D scenarios. With the rapid development of VR devices, panoramic videos have been a promising alternative to 2D videos to provide immersive feelings of the real world. In this paper, we aim to tackle the video salient object detection problem for panoramic videos, with their corresponding ambisonic audios. A multimodal fusion module equipped with two pseudo-siamese audio-visual context fusion (ACF) blocks is proposed to effectively conduct audio-visual interaction. The ACF block equipped with spherical positional encoding enables the fusion in the 3D context to capture the spatial correspondence between pixels and sound sources from the equirectangular frames and ambisonic audios. Experimental results verify the effectiveness of our proposed components and demonstrate that our method achieves state-of-the-art performance on the ASOD60K dataset.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — spherical positional encoding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xiang Li , Haoyuan Cao , Shijie Zhao , Junlin Li , Li Zhang , Bhiksha Raj

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Analysis > Object Detection Computer Vision > Processing > Video Processing Computer Vision > Processing > Video Understanding Computer Vision > Processing > Image Processing Deep Learning > Learning Types > Multi-Modal Learning

Keywords

multimodal learning audio-visual learning video understanding audio-visual fusion salient object detection multimodal fusion panoramic video spherical positional encoding

Download PDF

Related papers

A Model-Agnostic Heuristics for Selective Classification 2023

Tackling Safe and Efficient Multi-Agent Reinforcement Learning via Dynamic Shielding (Student Abstract) 2023

Head-Free Lightweight Semantic Segmentation with Linear Transformer 2023

Hierarchical ConViT with Attention-Based Relational Reasoner for Visual Analogical Reasoning 2023

Deep Spiking Neural Networks with High Representation Similarity Model Visual Pathways of Macaque and Mouse 2023