Can Foundation Models Watch, Talk and Guide You Step by Step to Make a Cake?

Yuwei Bao; Keunwoo Yu; Yichi Zhang; Shane Storks; Itamar Bar-Yossef; Alex de la Iglesia; Megan Su; Xiao Zheng; Joyce Chai

2023 EMNLP EMNLP 2023

Can Foundation Models Watch, Talk and Guide You Step by Step to Make a Cake?

Abstract

AbstractDespite tremendous advances in AI, it remains a significant challenge to develop interactive task guidance systems that can offer situated, personalized guidance and assist humans in various tasks. These systems need to have a sophisticated understanding of the user as well as the environment, and make timely accurate decisions on when and what to say. To address this issue, we created a new multimodal benchmark dataset, Watch, Talk and Guide (WTaG) based on natural interaction between a human user and a human instructor. We further proposed two tasks: User and Environment Understanding, and Instructor Decision Making. We leveraged several foundation models to study to what extent these models can be quickly adapted to perceptually enabled task guidance. Our quantitative, qualitative, and human evaluation results show that these models can demonstrate fair performances in some cases with no task-specific training, but a fast and reliable adaptation remains a significant challenge. Our benchmark and baselines will provide a stepping stone for future work on situated task guidance.

❓ The Questioner

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision

🧭 Keyword Pioneer — task guidance

🐣 Hot Topic Early Bird — visual understanding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yuwei Bao , Keunwoo Yu , Yichi Zhang , Shane Storks , Itamar Bar-Yossef , Alex de la Iglesia , Megan Su , Xiao Zheng , Joyce Chai

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Human-AI Interaction Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Applications > Question Answering

Keywords

multimodal learning interactive systems foundation model visual understanding human-ai interaction task guidance personalized guidance

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023