MOGO: Residual Quantized Hierarchical Causal Transformer for Real-Time and Infinite-Length 3D Human Motion Generation

Dongjie Fu; Tengjiao Sun; Pengcheng Fang; Xiaohao Cai; Hansung Kim

2026 AAAI AAAI 2026

MOGO: Residual Quantized Hierarchical Causal Transformer for Real-Time and Infinite-Length 3D Human Motion Generation

Abstract

Abstract Recent advances in transformer-based text-to-motion generation have significantly improved motion quality. However, achieving both real-time performance and long-horizon scalability remains an open challenge. In this paper, we present MOGO (Motion Generation with One-pass), a novel autoregressive framework for efficient and scalable 3D human motion generation. MOGO consists of two key components. First, we introduce MoSA-VQ, a motion scale-adaptive residual vector quantization module that hierarchically discretizes motion sequences through learnable scaling parameters, enabling dynamic allocation of representation capacity and producing compact yet expressive multi-level representations. Second, we design the RQHC-Transformer, a residual quantized hierarchical causal transformer that decodes motion tokens in a single forward pass. Each transformer block aligns with one quantization level, allowing hierarchical abstraction and temporally coherent generation with strong semantic flow. Compared to diffusion- and LLM-based approaches, MOGO achieves lower inference latency while preserving high motion fidelity. Moreover, its hierarchical latent design enables seamless and controllable infinite-length motion generation, with stable transitions and the ability to adaptively incorporate updated control signals at arbitrary points in time. To further enhance generalization and interpretability, we introduce Textual Condition Alignment (TCA), which leverages large language models with Chain-of-Thought reasoning to bridge the gap between real-world prompts and training data. TCA not only improves zero-shot performance on unseen datasets but also enriches motion comprehension for in-distribution prompts through explicit intent decomposition. Extensive experiments on HumanML3D, KIT-ML, and the unseen CMP dataset demonstrate that MOGO outperforms prior methods in generation quality, inference efficiency, and temporal scalability.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — infinite-length generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Dongjie Fu , Tengjiao Sun , Pengcheng Fang , Xiaohao Cai , Hansung Kim

Topics

Artificial Intelligence > Core AI > Foundation Models Machine Learning > Core Methods > Representation Learning

Keywords

chain-of-thought reasoning vector quantization human motion generation infinite-length generation

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026