Visual Relation Diffusion for Human-Object Interaction Detection

Ping Cao; Yepeng Tang; Chunjie Zhang; Xiaolong Zheng; Chao Liang; Yunchao Wei; Yao Zhao

2025 ICCV ICCV 2025

Visual Relation Diffusion for Human-Object Interaction Detection

Abstract

Human-object interaction (HOI) detection relies on fine-grained visual understanding to distinguish complex relationships between humans and objects. While recent generative diffusion models have demonstrated remarkable capability in learning detailed visual concepts through pixel-level generation, their potential for interaction-level relationship modeling remains largely unexplored. To bridge this gap, we propose a Visual Relation Diffusion model (VRDiff), which introduces dense visual relation conditions to guide interaction understanding. Specifically, we encode interaction-aware condition representations that capture both spatial responsiveness and contextual semantics of human-object pairs, conditioning the diffusion process purely on visual features rather than text-based inputs. Furthermore, we refine these relation representations through generative feedback from the diffusion model, enhancing HOI detection without requiring image synthesis. Extensive experiments on the HICO-DET benchmark demonstrate that VRDiff achieves competitive results under both standard and zero-shot HOI detection settings.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — visual relation diffusion

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ping Cao , Yepeng Tang , Chunjie Zhang , Xiaolong Zheng , Chao Liang , Yunchao Wei , Yao Zhao

Topics

Machine Learning > Learning Types > Zero-Shot Learning Deep Learning > Models > Diffusion Models Computer Vision > Analysis > Object Detection Computer Vision > Core AI > Computer Vision

Keywords

object detection human-object interaction diffusion model human-object interaction detection zero-shot detection visual relationship visual relation diffusion generative feedback

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025