2025 EMNLP EMNLP 2025

From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems

Abstract

AbstractFoundation models (FMs) are increasingly applied to bridge language and action in embodied agents, yet the operational characteristics of different integration strategies remain under-explored—especially for complex instruction following and versatile action generation in changing environments. We investigate three paradigms for robotic systems: end-to-end vision-language-action models (VLAs) that implicitly unify perception and planning, and modular pipelines using either vision-language models (VLMs) or multimodal large language models (MLLMs). Two case studies frame the comparison: instruction grounding, which probs fine-grained language understanding and cross-modal disambiguation; and object manipulation, which targets skill transfer via VLA finetuning. Our experiments reveal trade-offs in system scale, generalization and data efficiency. These findings indicate design lessons for language-driven physical agents and point to challenges and opportunities for FM-powered robotics in real-world conditions.

🧭 Keyword Pioneer — embodied robotics
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio