2025 WACV WACV 2025

Optimizing Vision-Language Model for Road Crossing Intention Estimation

Abstract

Identifying a pedestrian's intention to cross the road is crucial for autonomous driving as it alerts the system to stop or slow down. However determining crossing intention from video is challenging due to the need for extracting complex high-level semantics. This paper introduces ClipCross a novel classification framework optimized to extract high-level semantic features using the vision-language model CLIP for determining crossing intention. Existing CLIP-based methods perform poorly in this task as CLIP's image and text encoders fail to capture the nuanced semantic distinctions between crossing and non-crossing intention images. ClipCross addresses this by optimizing a set of CLIP text embeddings to extract high-level semantic features which a multi-layer perceptron uses to distinguish between crossing and non-crossing intentions. ClipCross achieves state-of-the-art performance on crossing intention estimation benchmark datasets: PIE PSI and JAAD.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Machine Learning
🧭 Keyword Pioneer — road crossing
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio