Visual Textualization for Image Prompted Object Detection

Yongjian Wu; Yang Zhou; Jiya Saiyin; Bingzheng Wei; Yan Xu

2025 ICCV ICCV 2025

Visual Textualization for Image Prompted Object Detection

Abstract

We propose VisTex-OVLM, a novel image prompted object detection method that introduces visual textualization ---- a process that projects a few visual exemplars into the text feature space to enhance Object-level Vision-Language Models' (OVLMs) capability in detecting rare categories that are difficult to describe textually and nearly absent from their pre-training data, while preserving their pre-trained object-text alignment. Specifically, VisTex-OVLM leverages multi-scale textualizing blocks and a multi-stage fusion strategy to integrate visual information from visual exemplars, generating textualized visual tokens that effectively guide OVLMs alongside text prompts. Unlike previous methods, our method maintains the original architecture of OVLM, maintaining its generalization capabilities while enhancing performance in few-shot settings. VisTex-OVLM demonstrates superior performance across open-set datasets which have minimal overlap with OVLM's pre-training data and achieves state-of-the-art results on few-shot benchmarks PASCAL VOC and MSCOCO . The code will be released at VisTex-OVLM.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — visual textualization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yongjian Wu , Yang Zhou , Jiya Saiyin , Bingzheng Wei , Yan Xu

Topics

Artificial Intelligence > Learning Paradigms > Few-Shot Learning Computer Vision > Analysis > Object Detection Machine Learning > Learning Paradigms > Few-Shot Learning Deep Learning > Models > Large Language Models

Keywords

few-shot learning object detection open-vocabulary detection vision-language model few-shot object detection open-set detection visual textualization image prompted detection object-level vision-language model

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025