FineCLIP: Self-distilled Region-based CLIP for Better Fine-grained Understanding

Dong Jing; Xiaolong He; Yutian Luo; Nanyi Fei; Guoxing Yang; Wei Wei; Huiwen Zhao; Zhiwu Lu

2024 NIPS NeurIPS 2024

FineCLIP: Self-distilled Region-based CLIP for Better Fine-grained Understanding

Abstract

Contrastive Language-Image Pre-training (CLIP) achieves impressive performance on tasks like image classification and image-text retrieval by learning on large-scale image-text datasets. However, CLIP struggles with dense prediction tasks due to the poor grasp of the fine-grained details. Although existing works pay attention to this issue, they achieve limited improvements and usually sacrifice the important visual-semantic consistency. To overcome these limitations, we propose FineCLIP, which keeps the global contrastive learning to preserve the visual-semantic consistency and further enhances the fine-grained understanding through two innovations: 1) A real-time self-distillation scheme that facilitates the transfer of representation capability from global to local features. 2) A semantically-rich regional contrastive learning paradigm with generated region-text pairs, boosting the local representation capabilities with abundant fine-grained knowledge. Both cooperate to fully leverage diverse semantics and multi-grained complementary information.To validate the superiority of our FineCLIP and the rationality of each design, we conduct extensive experiments on challenging dense prediction and image-level tasks. All the observations demonstrate the effectiveness of FineCLIP.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — fine-grained understanding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Deep Learning, Machine Learning

Authors

Dong Jing , Xiaolong He , Yutian Luo , Nanyi Fei , Guoxing Yang , Wei Wei , Huiwen Zhao , Zhiwu Lu

Topics

Machine Learning > Learning Types > Contrastive Learning Deep Learning > Architectures > Transformers Computer Vision > Analysis > Scene Understanding Computer Vision > Core AI > Multimodal Learning Deep Learning > Techniques > Contrastive Learning Deep Learning > Techniques > Self-Supervised Learning Deep Learning > Techniques > Knowledge Distillation Deep Learning > Models > Vision-Language Models

Keywords

contrastive learning multimodal learning vision-language model image-text retrieval dense prediction fine-grained understanding contrastive language-image pretraining regional contrastive learning visual-semantic consistency regional contrast

Download PDF

Related papers

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers 2024

Training for Stable Explanation for Free 2024

NeuralSolver: Learning Algorithms For Consistent and Efficient Extrapolation Across General Tasks 2024

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch 2024

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence 2024