Learning Representations from Audio-Visual Spatial Alignment

Pedro Morgado; Yi Li; Nuno Nvasconcelos

2020 NIPS NeurIPS 2020

Learning Representations from Audio-Visual Spatial Alignment

Abstract

We introduce a novel self-supervised pretext task for learning representations from audio-visual content. Prior work on audio-visual representation learning leverages correspondences at the video level. Approaches based on audio-visual correspondence (AVC) predict whether audio and video clips originate from the same or different video instances. Audio-visual temporal synchronization (AVTS) further discriminates negative pairs originated from the same video instance but at different moments in time. While these approaches learn high-quality representations for downstream tasks such as action recognition, they completely disregard the spatial cues of audio and visual signals naturally occurring in the real world. To learn from these spatial cues, we tasked a network to perform contrastive audio-visual spatial alignment of 360\degree video and spatial audio. The ability to perform spatial alignment is enhanced by reasoning over the full spatial content of the 360\degree video using a transformer architecture to combine representations from multiple viewpoints. The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks, including audio-visual correspondence, spatial alignment, action recognition and video semantic segmentation. Dataset and code are available at https://github.com/pedro-morgado/AVSpatialAlignment.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — audio-visual representation learning

🐣 Hot Topic Early Bird — audio-visual learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Pedro Morgado , Yi Li , Nuno Nvasconcelos

Topics

Machine Learning > Learning Types > Contrastive Learning Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Architectures > Transformers Deep Learning > Learning Types > Self-Supervised Learning Deep Learning > Learning Types > Contrastive Learning Deep Learning > Learning Types > Multi-Modal Learning

Keywords

transformer architecture representation learning contrastive learning self-supervised learning audio-visual learning spatial alignment audio-visual representation learning

Download PDF

Related papers

Higher-Order Spectral Clustering of Directed Graphs 2020

Self-Supervised MultiModal Versatile Networks 2020

Multi-Robot Collision Avoidance under Uncertainty with Probabilistic Safety Barrier Certificates 2020

Causal Intervention for Weakly-Supervised Semantic Segmentation 2020

Taming Discrete Integration via the Boon of Dimensionality 2020