UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

Yiheng Li; Ruibing Hou; Hong Chang; Shiguang Shan; Xilin Chen

2025 CVPR CVPR 2025

UniPose: A Unified Multimodal Framework for Human Pose Comprehension, Generation and Editing

Abstract

Human pose plays a crucial role in the digital age. While recent works have achieved impressive progress in understanding and generating human poses, they often support only a single modality of control signals and operate in isolation, limiting their application in real-world scenarios. This paper presents UniPose, a framework employing Large Language Models (LLMs) to comprehend, generate, and edit human poses across various modalities, including images, text, and 3D SMPL poses. Specifically, we apply a pose tokenizer to convert 3D poses into discrete pose tokens, enabling seamless integration into the LLM within a unified vocabulary. To further enhance the fine-grained pose perception capabilities, we facilitate UniPose with a mixture of visual encoders, among them a pose-specific visual encoder. Benefiting from a unified learning strategy, UniPose effectively transfers knowledge across different pose-relevant tasks, adapts to unseen tasks, and exhibits extended capabilities. This work serves as the first attempt at building a general-purpose framework for pose comprehension, generation, and editing. Extensive experiments highlight UniPose's competitive and even superior performance across various pose-relevant tasks. The code is available at https://github.com/liyiheng23/UniPose.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — pose tokenization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yiheng Li , Ruibing Hou , Hong Chang , Shiguang Shan , Xilin Chen

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Computer Vision > Analysis > Human Pose Estimation Deep Learning > Models > Large Language Models

Keywords

image generation transfer learning multimodal learning human pose estimation visual encoder large language model pose generation pose tokenization pose comprehension

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025