Tri-Modal Motion Retrieval by Learning a Joint Embedding Space

Kangning Yin; Shihao Zou; Yuxuan Ge; Zheng Tian

2024 CVPR CVPR 2024

Tri-Modal Motion Retrieval by Learning a Joint Embedding Space

Abstract

Text-to-motion tasks have been the focus of recent advancements in the human motion domain. However the performance of text-to-motion tasks have not reached its potential primarily due to the lack of motion datasets and the pronounced gap between the text and motion modalities. To mitigate this challenge we introduce VLMA a novel Video-Language-Motion Alignment method. This approach leverages human-centric videos as an intermediary modality effectively bridging the divide between text and motion. By employing contrastive learning we construct a cohesive embedding space across the three modalities. Furthermore we incorporate a motion reconstruction branch ensuring that the resulting motion remains closely aligned with its original trajectory. Experimental evaluations on the HumanML3D and KIT-ML datasets demonstrate the superiority of our method in comparison to existing approaches. Furthermore we introduce a novel task termed video-to-motion retrieval designed to facilitate the seamlessxt eraction of corresponding 3D motions from an RGB video. Supplementary experiments demonstrate that our model is extensible to real-world human-centric videos offering a valuable complement to the pose estimation task.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Machine Learning

🧭 Keyword Pioneer — motion retrieval

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Kangning Yin , Shihao Zou , Yuxuan Ge , Zheng Tian

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Embedding Learning Machine Learning > Learning Types > Contrastive Learning Computer Vision > Processing > Video Understanding Computer Vision > Analysis > Motion Analysis

Keywords

contrastive learning video understanding joint embedding joint embedding space motion retrieval video-language-motion alignment

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024