RoboTrom-Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction

Yufeng Zhong; Chengjian Feng; Feng Yan; Fanfan Liu; Liming Zheng; Lin Ma

2025 ICCV ICCV 2025

RoboTrom-Nav: A Unified Framework for Embodied Navigation Integrating Perception, Planning, and Prediction

Abstract

In language-guided visual navigation, agents locate target objects in unseen environments using natural language instructions. For reliable navigation in unfamiliar scenes, agents should possess strong perception, planning, and prediction capabilities. Additionally, when agents revisit previously explored areas during long-term navigation, they may retain irrelevant and redundant historical perceptions, leading to suboptimal results. In this work, we propose RoboTron-Nav, a unified framework that integrates p erception, p lanning, and p rediction capabilities through multitask collaborations on navigation and embodied question answering tasks, thereby enhancing navigation performances. Furthermore, RoboTron-Nav employs an adaptive 3D-aware history sampling strategy to effectively and efficiently utilize historical observations. By leveraging large language model, RoboTron-Nav comprehends diverse commands and complex visual scenes, resulting in appropriate navigation actions. RoboTron-Nav achieves an 81.1% success rate in object goal navigation on the \mathrm CHORES -\mathbb S benchmark, setting a new state-of-the-art performance.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Robotics

🧭 Keyword Pioneer — perception planning prediction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yufeng Zhong , Chengjian Feng , Feng Yan , Fanfan Liu , Liming Zheng , Lin Ma

Topics

Artificial Intelligence > Core AI > Agent Systems Artificial Intelligence > Core AI > Multi-Agent Systems Artificial Intelligence > Core AI > Planning Robotics > Capabilities > Navigation Artificial Intelligence > Core AI > Robotics

Keywords

motion planning multi-task learning object goal navigation visual navigation embodied navigation language-guided navigation perception planning prediction adaptive 3d-aware history sampling large language model

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025