2025 WACV WACV 2025

Local Masked Reconstruction for Efficient Self-Supervised Learning on High-Resolution Images

Abstract

Self-supervised learning for computer vision has progressed tremendously and improved many downstream vision tasks such as image classification semantic segmentation and object detection. Among these generative self-supervised vision learning approaches such as MAE and BEiT show promising performance. However their global reconstruction mechanism is computationally demanding especially for high-resolution images. The computational cost increases extensively when scaled to a large-scale dataset. To address this issue we propose local masked reconstruction (LoMaR) a simple yet effective approach that reconstructs image patches from small neighboring regions. The strategy can be easily integrated into any generative self-supervised learning techniques and improves the trade-off between efficiency and accuracy compared to reconstruction over the entire image. LoMaR is 2.5x faster than MAE and 5.0x faster than BEiT on 384x384 ImageNet pretraining and surpasses them by 0.2% and 0.8% in accuracy respectively. It is 2.1x faster than MAE on iNaturalist pretraining and gains 0.2% in accuracy. On MS COCO LoMaR outperforms MAE by 0.5 APbox on object detection and 0.5 APmask on instance segmentation. It also outperforms MAE by 0.2% on semantic segmentation. Our code and pretrained models are available at: https://github.com/junchen14/LoMaR.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio