Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

Yanbei Chen; Yongqin Xian; A. Sophia Koepke; Ying Shan; Zeynep Akata

2021 CVPR CVPR 2021

Distilling Audio-Visual Knowledge by Compositional Contrastive Learning

Abstract

Having access to multi-modal cues (e.g. vision and audio) empowers some cognitive tasks to be done faster compared to learning from a single modality. In this work, we propose to transfer knowledge across heterogeneous modalities, even though these data modalities may not be semantically correlated. Rather than directly aligning the representations of different modalities, we compose audio, image, and video representations across modalities to uncover the richer multi-modal knowledge. Our main idea is to learn a compositional embedding that closes the cross-modal semantic gap and captures the task-relevant semantics, which facilitates pulling together representations across modalities by compositional contrastive learning. We establish a new, comprehensive multi-modal distillation benchmark on three video datasets: UCF101, ActivityNet, and VGGSound. Moreover, we demonstrate that our model significantly outperforms a variety of existing knowledge distillation methods in transferring audio-visual knowledge to improve video representation learning.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🐣 Hot Topic Early Bird — audio-visual learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yanbei Chen , Yongqin Xian , A. Sophia Koepke , Ying Shan , Zeynep Akata

Topics

Machine Learning > Learning Types > Contrastive Learning Machine Learning > Application Areas > Knowledge Distillation Machine Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > Contrastive Learning Deep Learning > Learning Types > Knowledge Distillation

Keywords

representation learning contrastive learning knowledge distillation multimodal learning audio-visual learning multi-modal learning video representation

Download PDF

Related papers

Learning To Reconstruct High Speed and High Dynamic Range Videos From Events 2021

DeFLOCNet: Deep Image Editing via Flexible Low-Level Controls 2021

Vx2Text: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs 2021

Coming Down to Earth: Satellite-to-Street View Synthesis for Geo-Localization 2021

Pose-Guided Human Animation From a Single Image in the Wild 2021