TANGO: Training-free Embodied AI Agents for Open-world Tasks

Filippo Ziliotto; Tommaso Campari; Luciano Serafini; Lamberto Ballan

2025 CVPR CVPR 2025

TANGO: Training-free Embodied AI Agents for Open-world Tasks

Abstract

Large Language Models (LLMs) have demonstrated excellent capabilities in composing various modules together to create programs that can perform complex reasoning tasks on images. In this paper, we propose TANGO, an approach that extends the program composition via LLMs already observed for images, aiming to integrate those capabilities into embodied agents capable of observing and acting in the world. Specifically, by employing a simple PointGoal Navigation model combined with a memory-based exploration policy as a foundational primitive for guiding an agent through the world, we show how a single model can address diverse tasks without additional training. We task an LLM with composing the provided primitives to solve a specific task, using only a few in-context examples in the prompt. We evaluate our approach on three key Embodied AI tasks: Open-Set ObjectGoal Navigation, Multi-Modal Lifelong Navigation, and Open Embodied Question Answering, achieving state-of-the-art results without any specific fine-tuning in challenging zero-shot scenarios.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing and Robotics

🧭 Keyword Pioneer — program composition

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Filippo Ziliotto , Tommaso Campari , Luciano Serafini , Lamberto Ballan

Topics

Artificial Intelligence > Core AI > Agent Systems Artificial Intelligence > Core AI > Planning Natural Language Processing > Resources & Methods > Large Language Models Robotics > Capabilities > Navigation Artificial Intelligence > Core AI > Large Language Models

Keywords

zero-shot learning in-context learning visual navigation embodied ai agent system pointgoal navigation large language model embodied artificial intelligence program composition

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025