2024 ACL ACL 2024

Diving Deep into the Motion Representation of Video-Text Models

Abstract

AbstractVideos are more informative than images becausethey capture the dynamics of the scene.By representing motion in videos, we can capturedynamic activities. In this work, we introduceGPT-4 generated motion descriptions thatcapture fine-grained motion descriptions of activitiesand apply them to three action datasets.We evaluated several video-text models on thetask of retrieval of motion descriptions. Wefound that they fall far behind human expertperformance on two action datasets, raisingthe question of whether video-text models understandmotion in videos. To address it, weintroduce a method of improving motion understandingin video-text models by utilizingmotion descriptions. This method proves tobe effective on two action datasets for the motiondescription retrieval task. The results drawattention to the need for quality captions involvingfine-grained motion information in existingdatasets and demonstrate the effectiveness ofthe proposed pipeline in understanding finegrainedmotion during video-text retrieval.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision
🧭 Keyword Pioneer — motion description
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio