ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis

Yun Chang; Leonor Fermoselle; Duy Ta; Bernadette Bucher; Luca Carlone; Jiuguang Wang

2025 CVPR CVPR 2025

ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis

Abstract

While recent work in scene reconstruction and understanding has made strides in grounding natural language to physical 3D environments, it is still challenging to ground abstract, high-level instructions to a 3D scene. High-level instructions might not explicitly invoke semantic elements in the scene, and even the process of breaking a high-level task into a set of more concrete subtasks --a process called hierarchical task analysis-- is environment-dependent. In this work, we propose ASHiTA, the first framework that generates a task hierarchy grounded to a 3D scene graph by breaking down high-level tasks into grounded subtasks. ASHiTA alternates LLM-assisted hierarchical task analysis --to generate the task breakdown-- with task-driven scene graph construction to generate a suitable representation of the environment. Our experiments show that ASHiTA performs significantly better than LLM baselines in breaking down high-level tasks into environment-dependent subtasks and is additionally able to achieve grounding performance comparable to state-of-the-art methods

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Knowledge & Reasoning and Robotics

🧭 Keyword Pioneer — hierarchical task analysis

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yun Chang , Leonor Fermoselle , Duy Ta , Bernadette Bucher , Luca Carlone , Jiuguang Wang

Topics

Computer Vision > Analysis > 3D Vision Computer Vision > Analysis > Scene Understanding Knowledge & Reasoning > Reasoning > Automated Planning Robotics > Capabilities > Manipulation Robotics > Capabilities > Perception Artificial Intelligence > Core AI > Large Language Models

Keywords

scene understanding 3d scene understanding scene graph language grounding task planning task decomposition 3d scene graph large language model hierarchical task analysis

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025