Audiovisual Masked Autoencoders

Mariana-Iuliana Georgescu; Eduardo Fonseca; Radu Tudor Ionescu; Mario Lucic; Cordelia Schmid; Anurag Arnab

2023 ICCV ICCV 2023

Audiovisual Masked Autoencoders

Abstract

Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Mariana-Iuliana Georgescu , Eduardo Fonseca , Radu Tudor Ionescu , Mario Lucic , Cordelia Schmid , Anurag Arnab

Topics

Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Architectures > Autoencoders Deep Learning > Techniques > Pretraining

Keywords

representation learning transfer learning self-supervised learning multimodal learning masked autoencoder audiovisual learning

Download PDF

Related papers

PVT++: A Simple End-to-End Latency-Aware Visual Tracking Framework 2023

Periodically Exchange Teacher-Student for Source-Free Object Detection 2023

Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations 2023

Minimal Solutions to Uncalibrated Two-view Geometry with Known Epipoles 2023

3D Neural Embedding Likelihood: Probabilistic Inverse Graphics for Robust 6D Pose Estimation 2023