Motions as Queries: One-Stage Multi-Person Holistic Human Motion Capture

Kenkun Liu; Yurong Fu; Weihao Yuan; Jing Lin; Peihao Li; Xiaodong Gu; Lingteng Qiu; Haoqian Wang; Zilong Dong; Xiaoguang Han

2025 CVPR CVPR 2025

Motions as Queries: One-Stage Multi-Person Holistic Human Motion Capture

Abstract

Existing methods for capturing multi-person holistic human motions from a monocular video usually involve integrating the detector, the tracker, and the human pose & shape estimator into a cascaded system. Differently, we develop a one-stage multi-person holistic human motion capture system, which 1) employs only one network, enabling significant benefits from the end-to-end training on a large-scale dataset; 2) enables performance improving of the tracking module during training, avoiding being limited by a pre-trained tracker; 3) captures the motions of all individuals within a single shot, rather than tracking and estimating each person sequentially. In this system, each query within a temporal cross-attention module is responsible for the long motion of a specific individual, implicitly aggregating individual-specific information throughout the entire video. To further boost the proposed system from end-to-end training, we also construct a synthetic human video dataset, with multi-person and whole-body annotations. Extensive experiments across different datasets demonstrate both the efficacy and the efficiency of both the proposed method and the dataset. The code of our method will be made publicly available.

🌉 Interdisciplinary Bridge — Computer Vision and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Kenkun Liu , Yurong Fu , Weihao Yuan , Jing Lin , Peihao Li , Xiaodong Gu , Lingteng Qiu , Haoqian Wang , Zilong Dong , Xiaoguang Han

Topics

Machine Learning > Core Methods > Representation Learning Computer Vision > Analysis > Human Pose Estimation Computer Vision > Analysis > Object Tracking Computer Vision > Processing > Video Processing Computer Vision > Analysis > Motion Analysis

Keywords

pose estimation human motion capture video understanding human pose estimation end-to-end training multi-person tracking query-based detection query-based learning

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025