DIP: Unsupervised Dense In-Context Post-training of Visual Representations

Sophia Sirko-Galouchenko; Spyros Gidaris; Antonin Vobecky; Andrei Bursuc; Nicolas Thome

2025 ICCV ICCV 2025

DIP: Unsupervised Dense In-Context Post-training of Visual Representations

Abstract

We introduce DIP, a novel unsupervised post-training method designed to enhance dense representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches using complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that simulate downstream in-context scenarios, inspired by meta-learning principles. To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder. DI"P is simple, unsupervised, and computationally efficient, requiring under 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a variety of downstream real-world in-context scene understanding tasks. It outperforms both the initial vision encoder and prior methods, offering a practical and effective solution for improving dense representations.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sophia Sirko-Galouchenko , Spyros Gidaris , Antonin Vobecky , Andrei Bursuc , Nicolas Thome

Topics

Machine Learning > Core Methods > Representation Learning Machine Learning > Learning Types > Unsupervised Learning Computer Vision > Analysis > Scene Understanding Machine Learning > Learning Paradigms > Meta-Learning Deep Learning > Learning Types > Self-Supervised Learning Computer Vision > Core AI > Computer Vision

Keywords

unsupervised learning representation learning in-context learning visual representation vision encoder dense representation

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025