From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces

Peter Shaw; Mandar Joshi; James Cohan; Jonathan Berant; Panupong Pasupat; Hexiang Hu; Urvashi Khandelwal; Kenton Lee; Kristina N Toutanova

2023 NIPS NeurIPS 2023

From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces

Abstract

Much of the previous work towards digital agents for graphical user interfaces (GUIs) has relied on text-based representations (derived from HTML or other structured data sources), which are not always readily available. These input representations have been often coupled with custom, task-specific action spaces. This paper focuses on creating agents that interact with the digital world using the same conceptual interface that humans commonly use — via pixel-based screenshots and a generic action space corresponding to keyboard and mouse actions. Building upon recent progress in pixel-based pretraining, we show, for the first time, that it is possible for such agents to outperform human crowdworkers on the MiniWob++ benchmark of GUI-based instruction following tasks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Science and Computer Vision

🧭 Keyword Pioneer — digital agent

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Peter Shaw , Mandar Joshi , James Cohan , Jonathan Berant , Panupong Pasupat , Hexiang Hu , Urvashi Khandelwal , Kenton Lee , Kristina N Toutanova

Topics

Artificial Intelligence > Core AI > Agent Systems Computer Vision > Processing > Video Understanding Computer Science > Applications > Software Engineering Artificial Intelligence > Core AI > Multi-Modal Learning Computer Vision > Domain-Specific > Robotics

Keywords

imitation learning multi-modal learning instruction following digital agent graphical user interface pixel-based learning pixel-based pretraining

Download PDF

Related papers

Risk-Averse Model Uncertainty for Distributionally Robust Safe Reinforcement Learning 2023

Generative Modeling through the Semi-dual Formulation of Unbalanced Optimal Transport 2023

Self-Supervised Motion Magnification by Backpropagating Through Optical Flow 2023

Diffused Task-Agnostic Milestone Planner 2023

Characterizing Graph Datasets for Node Classification: Homophily-Heterophily Dichotomy and Beyond 2023