2025 WACV WACV 2025

Unleashing Potentials of Vision-Language Models for Zero-Shot HOI Detection

Abstract

Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions as <human action object> triplets. Recent advancements in pre-trained vision-language model (VLM) have improved zero-shot HOI detection enabling identification of unseen triplets. However existing methods leverage the VLM as an additional encoder only for interaction prediction not for human/object detection. This limitation hinders their ability to detect unseen objects. Furthermore the additional encoder increases both model size and computational cost. This paper proposes a novel HOI detection framework ECI-HOI which unleashes potentials of the pre-trained VLM for the zero-shot HOI detection by leveraging it for both of the sub-tasks. We first employ CLIP as a single image encoder reducing redundancy in the network architecture. In addition we propose an instance selector and a HO pair decoder to effectively harmonize the human/object detection and the interaction prediction in zero-shot manner. We evaluate our model under various settings on HICO-DET and our two new testsets: out-of-distribution image testset and novel object testset. Our model outperforms the state-of-the-art models while reducing the model size by more than 50% especially achieving a +10.01 mAP improvement under the unseen object setting on HICO-DET. The results on the proposed datasets highlight the zero-shot performance of our model on more challenging settings.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio