ExpertAF: Expert Actionable Feedback from Video

Kumar Ashutosh; Tushar Nagarajan; Georgios Pavlakos; Kris Kitani; Kristen Grauman

2025 CVPR CVPR 2025

ExpertAF: Expert Actionable Feedback from Video

Abstract

Feedback is essential for learning a new skill or improving one's current skill-level. However, current methods for skill-assessment from video only provide scores or compare demonstrations, leaving the burden of knowing what to do differently on the user. We introduce a novel method to generate actionable feedback (AF) from video of a person doing a physical activity, such as basketball or soccer. Our method takes a video demonstration and its accompanying 3D body pose and generates (1) free-form expert commentary describing what the person is doing well and what they could improve, and (2) a visual expert demonstration that incorporates the required corrections. We show how to leverage Ego-Exo4D's videos of skilled activity and expert commentary together with a strong language model to create a weakly-supervised training dataset for this task, and we devise a multimodal video-language model to infer coaching feedback. Our method is able to reason across multi-modal input combinations to output full-spectrum, actionable coaching--expert commentary, expert video retrieval, and expert pose generation--outperforming strong vision-language models on both established metrics and human preference studies.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — multimodal video-language model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Natural Language Processing, Robotics, Speech & Audio

Authors

Kumar Ashutosh , Tushar Nagarajan , Georgios Pavlakos , Kris Kitani , Kristen Grauman

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Weakly Supervised Learning Natural Language Processing > Generation > Text Generation

Keywords

weakly-supervised learning 3d body pose actionable feedback multimodal video-language model expert commentary

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025