What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction

Sunny Panchal; Apratim Bhattacharyya; Guillaume Berger; Antoine Mercier; Cornelius Böhm; Florian Dietrichkeit; Reza Pourreza; Xuanlin Li; Pulkit Madan; Mingu Lee; Mark Todorovich; Ingo Bax; Roland Memisevic

2024 NIPS NeurIPS 2024

What to Say and When to Say it: Live Fitness Coaching as a Testbed for Situated Interaction

Abstract

Vision-language models have shown impressive progress in recent years. However, existing models are largely limited to turn-based interactions, where each turn must be stepped (i.e., prompted) by the user. Open-ended, asynchronous interactions, where an AI model may proactively deliver timely responses or feedback based on the unfolding situation in real-time, are an open challenge. In this work, we present the QEVD benchmark and dataset, which explores human-AI interaction in the challenging, yet controlled, real-world domain of fitness coaching – a task which intrinsically requires monitoring live user activity and providing immediate feedback. The benchmark requires vision-language models to recognize complex human actions, identify possible mistakes, and provide appropriate feedback in real-time. Our experiments reveal the limitations of existing state-of-the-art vision-language models for such asynchronous situated interactions. Motivated by this, we propose a simple end-to-end streaming baseline that can respond asynchronously to human actions with appropriate feedback at the appropriate time.

🧭 Keyword Pioneer — situated interaction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Natural Language Processing, Reinforcement Learning

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

Authors

Sunny Panchal , Apratim Bhattacharyya , Guillaume Berger , Antoine Mercier , Cornelius Böhm , Florian Dietrichkeit , Reza Pourreza , Xuanlin Li , Pulkit Madan , Mingu Lee , Mark Todorovich , Ingo Bax , Roland Memisevic

Topics

Artificial Intelligence > Core AI > Human-AI Interaction Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Core AI > Multimodal Learning Machine Learning > Learning Types > Multimodal Learning Deep Learning > Models > Vision-Language Models

Keywords

action recognition human action recognition vision-language model situated interaction real-time feedback asynchronous interaction human-ai interaction fitness coaching

Download PDF

Related papers

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers 2024

Training for Stable Explanation for Free 2024

NeuralSolver: Learning Algorithms For Consistent and Efficient Extrapolation Across General Tasks 2024

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch 2024

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence 2024