VLCounter: Text-Aware Visual Representation for Zero-Shot Object Counting

Seunggu Kang; WonJun Moon; Euiyeon Kim; Jae-Pil Heo

2024 AAAI AAAI 2024

VLCounter: Text-Aware Visual Representation for Zero-Shot Object Counting

Abstract

Abstract Zero-Shot Object Counting~(ZSOC) aims to count referred instances of arbitrary classes in a query image without human-annotated exemplars. To deal with ZSOC, preceding studies proposed a two-stage pipeline: discovering exemplars and counting. However, there remains a challenge of vulnerability to error propagation of the sequentially designed two-stage process. In this work, we propose an one-stage baseline, Visual-Language Baseline (VLBase), exploring the implicit association of the semantic-patch embeddings of CLIP. Subsequently, we extend the VLBase to Visual-language Counter (VLCounter) by incorporating three modules devised to tailor VLBase for object counting. First, we introduce Semantic-conditioned Prompt Tuning (SPT) within the image encoder to acquire target-highlighted representations. Second, Learnable Affine Transformation (LAT) is employed to translate the semantic-patch similarity map to be appropriate for the counting task. Lastly, we transfer the layer-wisely encoded features to the decoder through Segment-aware Skip Connection (SaSC) to keep the generalization capability for unseen classes. Through extensive experiments on FSC147, CARPK, and PUCPR+, we demonstrate the benefits of our end-to-end framework, VLCounter. Code is available at https://github.com/seunggu0305/VLCounter

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — semantic-patch embedding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Seunggu Kang , WonJun Moon , Euiyeon Kim , Jae-Pil Heo

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Zero-Shot Learning Deep Learning > Techniques > Pretraining Computer Vision > Analysis > Object Detection Deep Learning > Learning Types > Zero-Shot Learning Deep Learning > Models > Vision-Language Models

Keywords

zero-shot learning object counting semantic embedding vision-language model visual-language model clip model prompt tuning affine transformation zero-shot object counting semantic-patch embedding

Download PDF

Related papers

Goal Alignment: Re-analyzing Value Alignment Problems Using Human-Aware AI 2024

Meta-Inverse Reinforcement Learning for Mean Field Games via Probabilistic Context Variables 2024

Suppressing Uncertainty in Gaze Estimation 2024

Mask-Homo: Pseudo Plane Mask-Guided Unsupervised Multi-Homography Estimation 2024

Heterogeneous Test-Time Training for Multi-Modal Person Re-identification 2024