Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation

Shaofei Huang; Han Li; Yuqing Wang; Hongji Zhu; Jiao Dai; Jizhong Han; Wenge Rong; Si Liu

2023 IJCAI IJCAI 2023

Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation

Abstract

Audio visual segmentation (AVS) aims to segment the sounding objects for each frame of a given video. To distinguish the sounding objects from silent ones, both audio-visual semantic correspondence and temporal interaction are required. The previous method applies multi-frame cross-modal attention to conduct pixel-level interactions between audio features and visual features of multiple frames simultaneously, which is both redundant and implicit. In this paper, we propose an Audio-Queried Transformer architecture, AQFormer, where we define a set of object queries conditioned on audio information and associate each of them to particular sounding objects. Explicit object-level semantic correspondence between audio and visual modalities is established by gathering object information from visual features with predefined audio queries. Besides, an Audio-Bridged Temporal Interaction module is proposed to exchange sounding object-relevant information among multiple frames with the bridge of audio features. Extensive experiments are conducted on two AVS benchmarks to show that our method achieves state-of-the-art performances, especially 7.1% M_J and 7.6% M_F gains on the MS3 setting.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🧭 Keyword Pioneer — sounding object

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Shaofei Huang , Han Li , Yuqing Wang , Hongji Zhu , Jiao Dai , Jizhong Han , Wenge Rong , Si Liu

Topics

Deep Learning > Architectures > Transformers Computer Vision > Analysis > Scene Understanding Computer Vision > Processing > Semantic Segmentation

Keywords

cross-modal attention sounding object audio visual segmentation pixel-level interaction

Download PDF

Related papers

Analyzing Intentional Behavior in Autonomous Agents under Uncertainty 2023

Deep Hashing-based Dynamic Stock Correlation Estimation via Normalizing Flow 2023

U-Match: Two-view Correspondence Learning with Hierarchy-aware Local Context Aggregation 2023

Artificial Agents Inspired by Human Motivation Psychology for Teamwork in Hazardous Environments 2023

Proportionally Fair Online Allocation of Public Goods with Predictions 2023