Open Ad-hoc Categorization with Contextualized Feature Learning

Zilin Wang; Sangwoo Mo; Stella X. Yu; Sima Behpour; Liu Ren

2025 CVPR CVPR 2025

Open Ad-hoc Categorization with Contextualized Feature Learning

Abstract

Adaptive categorization of visual scenes is essential for AI agents to handle changing tasks. Unlike fixed common categories for plants or animals, ad-hoc categories, such as things to sell at a garage sale, are created dynamically to achieve specific tasks. We study open ad-hoc categorization, where the goal is to infer novel concepts and categorize images based on a given context, a small set of labeled exemplars, and some unlabeled data. We have two key insights: 1) recognizing ad-hoc categories relies on the same perceptual processes as common categories; 2) novel concepts can be discovered semantically by expanding contextual cues or visually by clustering similar patterns. We propose OAK, a simple model that introduces a single learnable context token into CLIP, trained with CLIP's objective of aligning visual and textual features and GCD's objective of clustering similar images. On Stanford and Clevr-4 datasets, OAK consistently achieves the state-of-art in accuracy and concept discovery across multiple categorizations, including 87.4% novel accuracy on Stanford Mood, surpassing CLIP and GCD by over 50%. Moreover, OAK generates interpretable saliency maps, focusing on hands for Action, faces for Mood, and backgrounds for Location, promoting transparency and trust while enabling accurate and flexible categorization.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — ad-hoc categorization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zilin Wang , Sangwoo Mo , Stella X. Yu , Sima Behpour , Liu Ren

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Unsupervised Learning Deep Learning > Architectures > Transformers Computer Vision > Analysis > Object Detection Machine Learning > Learning Types > Few-Shot Learning Deep Learning > Learning Types > Contrastive Learning Computer Vision > Analysis > Image Classification Deep Learning > Models > Vision-Language Models

Keywords

image classification contrastive learning zero-shot learning image clustering vision-language model open vocabulary concept discovery ad-hoc categorization contextual feature learning novel concept discovery contextualized feature learning

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025