2021 RSS RSS 2021

INVIGORATE: Interactive Visual Grounding and Grasping in Clutter

Abstract

This paper presents INVIGORATE; a robot system that interacts with humans through natural language and grasps a specified object in clutter. The objects may occlude; obstruct; or even stack on top of one another. INVIGORATE embodies several challenges: (i) infer the target object among other occluding objects; from input language expressions and RGB images; (ii) infer object blocking relationships (OBRs) from the images; and (iii) synthesize a multi-step plan to ask questions that disambiguate the target object and to grasp it successfully. We train separate neural networks for object detection; for visual grounding; for question generation; and for OBR detection and grasping. They allow for unrestricted object categories and language expressions; subject to the training datasets. However; errors in visual perception and ambiguity in human languages are inevitable and negatively impact the robot’s performance. To overcome these uncertainties; we build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules. Through approximate POMDP planning; the robot tracks the history of observations and asks disambiguation questions in order to achieve a near-optimal sequence of actions that identify and grasp the target object. INVIGORATE combines the benefits of model-based POMDP planning and data-driven deep learning. Preliminary experiments with INVIGORATE on a Fetch robot show significant benefits of this integrated approach to object grasping in clutter with natural language interactions. A demonstration video is available online: https://youtu.be/zYakh80SGcU.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio