2024 CVPR CVPR 2024

Tri-Modal Motion Retrieval by Learning a Joint Embedding Space

Abstract

Text-to-motion tasks have been the focus of recent advancements in the human motion domain. However the performance of text-to-motion tasks have not reached its potential primarily due to the lack of motion datasets and the pronounced gap between the text and motion modalities. To mitigate this challenge we introduce VLMA a novel Video-Language-Motion Alignment method. This approach leverages human-centric videos as an intermediary modality effectively bridging the divide between text and motion. By employing contrastive learning we construct a cohesive embedding space across the three modalities. Furthermore we incorporate a motion reconstruction branch ensuring that the resulting motion remains closely aligned with its original trajectory. Experimental evaluations on the HumanML3D and KIT-ML datasets demonstrate the superiority of our method in comparison to existing approaches. Furthermore we introduce a novel task termed video-to-motion retrieval designed to facilitate the seamlessxt eraction of corresponding 3D motions from an RGB video. Supplementary experiments demonstrate that our model is extensible to real-world human-centric videos offering a valuable complement to the pose estimation task.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Machine Learning
🧭 Keyword Pioneer — motion retrieval
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio