Self-Supervised Learning of Video-Induced Visual Invariances

Michael Tschannen; Josip Djolonga; Marvin Ritter; Aravindh Mahendran; Neil Houlsby; Sylvain Gelly; Mario Lucic

2020 CVPR CVPR 2020

Self-Supervised Learning of Video-Induced Visual Invariances

Abstract

We propose a general framework for self-supervised learning of transferable visual representations based on Video-Induced Visual Invariances (VIVI). We consider the implicit hierarchy present in the videos and make use of (i) frame-level invariances (e.g. stability to color and contrast perturbations), (ii) shot/clip-level invariances (e.g. robustness to changes in object orientation and lighting conditions), and (iii) video-level invariances (semantic relationships of scenes across shots/clips), to define a holistic self-supervised loss. Training models using different variants of the proposed framework on videos from the YouTube-8M (YT8M) data set, we obtain state-of-the-art self-supervised transfer learning results on the 19 diverse downstream tasks of the Visual Task Adaptation Benchmark (VTAB), using only 1000 labels per task. We then show how to co-train our models jointly with labeled images, outperforming an ImageNet-pretrained ResNet-50 by 0.8 points with 10x fewer labeled images, as well as the previous best supervised model by 3.7 points using the full ImageNet data set.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — visual invariance

🐣 Hot Topic Early Bird — video representation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Michael Tschannen , Josip Djolonga , Marvin Ritter , Aravindh Mahendran , Neil Houlsby , Sylvain Gelly , Mario Lucic

Topics

Artificial Intelligence > Learning Paradigms > Transfer Learning Machine Learning > Learning Types > Self-Supervised Learning Computer Vision > Processing > Video Processing Deep Learning > Learning Types > Self-Supervised Learning Machine Learning > Learning Paradigms > Self-Supervised Learning Computer Vision > Core AI > Computer Vision Deep Learning > Learning Types > Transfer Learning Deep Learning > Techniques > Representation Learning

Keywords

representation learning transfer learning self-supervised learning video understanding video representation transferable representation visual invariance

Download PDF

Related papers

Deep Polarization Cues for Transparent Object Segmentation 2020

HRank: Filter Pruning Using High-Rank Feature Map 2020

Panoptic-Based Image Synthesis 2020

Select, Supplement and Focus for RGB-D Saliency Detection 2020

ClusterVO: Clustering Moving Instances and Estimating Visual Odometry for Self and Surroundings 2020