Human Motion Synthesis in 3D Scenes via Unified Scene Semantic Occupancy

Jingyu Gong; Kunkun Tong; Zhuoran Chen; Chuanhan Yuan; Mingang Chen; Zhizhong Zhang; Xin Tan; Yuan Xie

2026 AAAI AAAI 2026

Human Motion Synthesis in 3D Scenes via Unified Scene Semantic Occupancy

Abstract

Abstract Human motion synthesis in 3D scenes relies heavily on scene comprehension, while current methods focus mainly on scene structure but ignore the semantic understanding. In this paper, we propose a human motion synthesis framework that take an unified Scene Semantic Occupancy (SSO) for scene representation, termed SSOMotion. We design a bi-directional tri-plane decomposition to derive a compact version of the SSO, and scene semantics are mapped to an unified feature space via CLIP encoding and shared linear dimensionality reduction. Such strategy can derive the fine-grained scene semantic structures while significantly reduce redundant computations. We further take these scene hints and movement direction derived from instructions for motion control via frame-wise scene query. Extensive experiments and ablation studies conducted on cluttered scenes using ShapeNet furniture, as well as scanned scenes from PROX and Replica datasets, demonstrate its cutting-edge performance while validating its effectiveness and generalization ability.

🌉 Interdisciplinary Bridge — Computer Vision and Machine Learning

🧭 Keyword Pioneer — clip encoding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Jingyu Gong , Kunkun Tong , Zhuoran Chen , Chuanhan Yuan , Mingang Chen , Zhizhong Zhang , Xin Tan , Yuan Xie

Topics

Machine Learning > Core Methods > Representation Learning Computer Vision > Analysis > 3D Vision Computer Vision > Analysis > Human Analysis

Keywords

scene understanding human motion synthesis semantic occupancy 3d scene clip encoding

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026