Unleashing Potentials of Vision-Language Models for Zero-Shot HOI Detection

Moyuru Yamada; Nimish Dharamshi; Ayushi Kohli; Prasad Kasu; Ainulla Khan; Manu Ghulyani

2025 WACV WACV 2025

Unleashing Potentials of Vision-Language Models for Zero-Shot HOI Detection

Abstract

Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions as <human action object> triplets. Recent advancements in pre-trained vision-language model (VLM) have improved zero-shot HOI detection enabling identification of unseen triplets. However existing methods leverage the VLM as an additional encoder only for interaction prediction not for human/object detection. This limitation hinders their ability to detect unseen objects. Furthermore the additional encoder increases both model size and computational cost. This paper proposes a novel HOI detection framework ECI-HOI which unleashes potentials of the pre-trained VLM for the zero-shot HOI detection by leveraging it for both of the sub-tasks. We first employ CLIP as a single image encoder reducing redundancy in the network architecture. In addition we propose an instance selector and a HO pair decoder to effectively harmonize the human/object detection and the interaction prediction in zero-shot manner. We evaluate our model under various settings on HICO-DET and our two new testsets: out-of-distribution image testset and novel object testset. Our model outperforms the state-of-the-art models while reducing the model size by more than 50% especially achieving a +10.01 mAP improvement under the unseen object setting on HICO-DET. The results on the proposed datasets highlight the zero-shot performance of our model on more challenging settings.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Moyuru Yamada , Nimish Dharamshi , Ayushi Kohli , Prasad Kasu , Ainulla Khan , Manu Ghulyani

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Learning Paradigms > Transfer Learning Computer Vision > Analysis > Object Detection Artificial Intelligence > Learning Paradigms > Zero-Shot Learning Machine Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > Zero-Shot Learning

Keywords

zero-shot learning object detection instance selection vision-language model clip model human-object interaction detection

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025