ATM: Enhanced Alignment for Text-to-Motion Generation

Ke Han; Yueming LYU; Weichen Yu; Nicu Sebe

2026 WACV WACV 2026

ATM: Enhanced Alignment for Text-to-Motion Generation

Abstract

Existing text-to-motion (T2M) generation methods primarily rely on regression-based objectives, such as minimizing positional errors. However, they lack effective semantic supervision and correction mechanisms, often leading to substantial misalignment between text and motion. To address this, we propose Aligned Text-to-Motion (ATM), a semantics-aware generation framework that automatically identifies and corrects text-motion misalignment. ATM incorporates two key components: (1) Inter-motion alignment, which detects semantic contradictions across motions and applies adaptive corrections based on the degree of semantic discrepancy, flexibly handing diverse misalignments and ensuring global text-motion consistency; (2) Intra-motion alignment, which refines locally missing or inaccurate motion semantics in an unsupervised manner by inferring semantic proxies, effectively addressing the absence of localized textual annotations. ATM is model-agnostic and can be seamlessly integrated into various T2M methods as a plug-and-play module. Extensive experiments on HumanML3D and KIT demonstrate that ATM consistently improves both generation quality and text-motion alignment. Code is available at https://github.com/ke-han-aca/ATM.git.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — inter-motion alignment

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio