2025 CVPR CVPR 2025

Motions as Queries: One-Stage Multi-Person Holistic Human Motion Capture

Abstract

Existing methods for capturing multi-person holistic human motions from a monocular video usually involve integrating the detector, the tracker, and the human pose & shape estimator into a cascaded system. Differently, we develop a one-stage multi-person holistic human motion capture system, which 1) employs only one network, enabling significant benefits from the end-to-end training on a large-scale dataset; 2) enables performance improving of the tracking module during training, avoiding being limited by a pre-trained tracker; 3) captures the motions of all individuals within a single shot, rather than tracking and estimating each person sequentially. In this system, each query within a temporal cross-attention module is responsible for the long motion of a specific individual, implicitly aggregating individual-specific information throughout the entire video. To further boost the proposed system from end-to-end training, we also construct a synthetic human video dataset, with multi-person and whole-body annotations. Extensive experiments across different datasets demonstrate both the efficacy and the efficiency of both the proposed method and the dataset. The code of our method will be made publicly available.

🌉 Interdisciplinary Bridge — Computer Vision and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio