T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting

Yifei Qian; Zhongliang Guo; Bowen Deng; Chun Tong Lei; Shuai Zhao; Chun Pong Lau; Xiaopeng Hong; Michael P. Pound

2025 CVPR CVPR 2025

T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting

Abstract

Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions. Existing methods typically rely on vision-language models like CLIP, but often exhibit limited sensitivity to text prompts. We present T2ICount, a diffusion-based framework that leverages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models. While one-step denoising ensures efficiency, it leads to weakened text sensitivity. To address this challenge, we propose a Hierarchical Semantic Correction Module that progressively refines text-image feature alignment, and a Representational Regional Coherence Loss that provides reliable supervision signals by leveraging the cross-attention maps extracted from the denoising U-Net. Furthermore, we observe that current benchmarks mainly focus on majority objects in images, potentially masking models' text sensitivity. To address this, we contribute a challenging re-annotated subset of FSC147 for better evaluation of text-guided counting ability. Extensive experiments demonstrate that our method achieves superior performance across different benchmarks. Code is available at https://github.com/cha15yq/T2ICount

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — hierarchical semantic correction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yifei Qian , Zhongliang Guo , Bowen Deng , Chun Tong Lei , Shuai Zhao , Chun Pong Lau , Xiaopeng Hong , Michael P. Pound

Topics

Machine Learning > Learning Types > Zero-Shot Learning Deep Learning > Models > Diffusion Models Computer Vision > Analysis > Object Detection Computer Vision > Core AI > Multimodal Learning Deep Learning > Learning Types > Zero-Shot Learning Deep Learning > Learning Types > Multimodal Learning

Keywords

cross-modal learning diffusion model text-to-image model vision-language model semantic correction cross-modal understanding text-image alignment zero-shot counting zero-shot object counting hierarchical semantic correction

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025