CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

Xinhao Liu; Jintong Li; Yicheng Jiang; Niranjan Sujay; Zhicheng Yang; Juexiao Zhang; John Abanes; Jing Zhang; Chen Feng

2025 CVPR CVPR 2025

CityWalker: Learning Embodied Urban Navigation from Web-Scale Videos

Abstract

Navigating dynamic urban environments presents significant challenges for embodied agents, requiring advanced spatial reasoning and adherence to common-sense norms. Despite progress, existing visual navigation methods struggle in map-free or off-street settings, limiting the deployment of autonomous agents like last-mile delivery robots. To overcome these obstacles, we propose a scalable, data-driven approach for human-like urban navigation by training agents on thousands of hours of in-the-wild city walking and driving videos sourced from the web. We introduce a simple and scalable data processing pipeline that extracts action supervision from these videos, enabling large-scale imitation learning without costly annotations. Our model learns sophisticated navigation policies to handle diverse challenges and critical scenarios. Experimental results show that training on large-scale, diverse datasets significantly enhances navigation performance, surpassing current methods. This work shows the potential of using abundant online video data to develop robust navigation policies for embodied agents in dynamic urban settings.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Reinforcement Learning

🧭 Keyword Pioneer — action supervision

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Xinhao Liu , Jintong Li , Yicheng Jiang , Niranjan Sujay , Zhicheng Yang , Juexiao Zhang , John Abanes , Jing Zhang , Chen Feng

Topics

Artificial Intelligence > Core AI > Agent Systems Reinforcement Learning > Applications > Robotics Machine Learning > Learning Types > Imitation Learning Artificial Intelligence > Core AI > Robotics Deep Learning > Learning Types > Reinforcement Learning

Keywords

imitation learning visual navigation embodied agent spatial reasoning embodied navigation urban navigation agent policy action supervision

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025