HuMoCon: Concept Discovery for Human Motion Understanding

Qihang Fang; Chengcheng Tang; Bugra Tekin; Shugao Ma; Yanchao Yang

2025 CVPR CVPR 2025

HuMoCon: Concept Discovery for Human Motion Understanding

Abstract

We present HuMoCon, a novel motion-video understanding framework designed for advanced human behavior analysis. The core of our method is a human motion concept discovery framework that efficiently trains multi-modal encoders to extract semantically meaningful and generalizable features. HuMoCon addresses key challenges in motion concept discovery for understanding and reasoning, including the lack of explicit multi-modality feature alignment and the loss of high-frequency information in masked autoencoding frameworks. Our approach integrates a feature alignment strategy that leverages video for contextual understanding and motion for fine-grained interaction modeling, further with a velocity reconstruction mechanism to enhance high-frequency feature expression and mitigate temporal over-smoothing. Comprehensive experiments on standard benchmarks demonstrate that HuMoCon enables effective motion concept discovery and significantly outperforms state-of-the-art methods in training large models for human motion understanding.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Qihang Fang , Chengcheng Tang , Bugra Tekin , Shugao Ma , Yanchao Yang

Topics

Deep Learning > Techniques > Pretraining Computer Vision > Analysis > Action Recognition Computer Vision > Analysis > Human Analysis Computer Vision > Core AI > Multimodal Learning Deep Learning > Learning Types > Self-Supervised Learning Computer Vision > Analysis > Motion Analysis

Keywords

feature alignment motion analysis self-supervised learning multi-modal learning video understanding human motion masked autoencoder motion understanding human motion analysis

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025