Local Masked Reconstruction for Efficient Self-Supervised Learning on High-Resolution Images

Jun Chen; Faizan Farooq Khan; Ming Hu; Ammar Sherif; Zongyuan Ge; Boyang Li; Mohamed Elhoseiny

2025 WACV WACV 2025

Local Masked Reconstruction for Efficient Self-Supervised Learning on High-Resolution Images

Abstract

Self-supervised learning for computer vision has progressed tremendously and improved many downstream vision tasks such as image classification semantic segmentation and object detection. Among these generative self-supervised vision learning approaches such as MAE and BEiT show promising performance. However their global reconstruction mechanism is computationally demanding especially for high-resolution images. The computational cost increases extensively when scaled to a large-scale dataset. To address this issue we propose local masked reconstruction (LoMaR) a simple yet effective approach that reconstructs image patches from small neighboring regions. The strategy can be easily integrated into any generative self-supervised learning techniques and improves the trade-off between efficiency and accuracy compared to reconstruction over the entire image. LoMaR is 2.5x faster than MAE and 5.0x faster than BEiT on 384x384 ImageNet pretraining and surpasses them by 0.2% and 0.8% in accuracy respectively. It is 2.1x faster than MAE on iNaturalist pretraining and gains 0.2% in accuracy. On MS COCO LoMaR outperforms MAE by 0.5 APbox on object detection and 0.5 APmask on instance segmentation. It also outperforms MAE by 0.2% on semantic segmentation. Our code and pretrained models are available at: https://github.com/junchen14/LoMaR.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jun Chen , Faizan Farooq Khan , Ming Hu , Ammar Sherif , Zongyuan Ge , Boyang Li , Mohamed Elhoseiny

Topics

Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Application Areas > Efficient Computing Deep Learning > Architectures > Transformers Deep Learning > Techniques > Pretraining Deep Learning > Techniques > Self-Supervised Learning

Keywords

image classification semantic segmentation vision transformer object detection self-supervised learning masked autoencoder efficient learning masked reconstruction image patch

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025