Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning

Shentong Mo; Shengbang Tong

2024 NIPS NeurIPS 2024

Connecting Joint-Embedding Predictive Architecture with Contrastive Self-supervised Learning

Abstract

In recent advancements in unsupervised visual representation learning, the Joint-Embedding Predictive Architecture (JEPA) has emerged as a significant method for extracting visual features from unlabeled imagery through an innovative masking strategy. Despite its success, two primary limitations have been identified: the inefficacy of Exponential Moving Average (EMA) from I-JEPA in preventing entire collapse and the inadequacy of I-JEPA prediction in accurately learning the mean of patch representations. Addressing these challenges, this study introduces a novel framework, namely C-JEPA (Contrastive-JEPA), which integrates the Image-based Joint-Embedding Predictive Architecture with the Variance-Invariance-Covariance Regularization (VICReg) strategy. This integration is designed to effectively learn the variance/covariance for preventing entire collapse and ensuring invariance in the mean of augmented views, thereby overcoming the identified limitations. Through empirical and theoretical evaluations, our work demonstrates that C-JEPA significantly enhances the stability and quality of visual representation learning. When pre-trained on the ImageNet-1K dataset, C-JEPA exhibits rapid and improved convergence in both linear probing and fine-tuning performance metrics.

🧭 Keyword Pioneer — joint-embedding predictive architecture

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

Authors

Shentong Mo , Shengbang Tong

Topics

Machine Learning > Learning Types > Contrastive Learning Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Architectures > Neural Networks Computer Vision > Core AI Deep Learning > Techniques > Contrastive Learning Deep Learning > Learning Types > Self-Supervised Learning Computer Vision > Core AI > Computer Vision

Keywords

unsupervised learning contrastive learning self-supervised learning visual representation visual representation learning joint-embedding predictive architecture variance invariance joint embedding predictive architecture variance invariance covariance regularization image-based joint-embedding predictive architecture image-based joint-embedding

Download PDF

Related papers

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers 2024

Training for Stable Explanation for Free 2024

NeuralSolver: Learning Algorithms For Consistent and Efficient Extrapolation Across General Tasks 2024

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch 2024

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence 2024