2025 ICCV ICCV 2025

Denoising Token Prediction in Masked Autoregressive Models

Abstract

Autoregressive models are just at a tipping point where they could really take off for visual generation. In this paper, we propose to model token prediction using diffusion procedure particularly in masked autoregressive models for image generation. We look into the problem from two critical perspectives: progressively refining the unmasked tokens prediction via a denoising head with the autoregressive model, and representing masked tokens probability distribution by capitalizing on the interdependency across masked and unmasked tokens through a diffusion head. Our proposal harbors an innate agency that remains advantageous in the speed of sequence prediction, and strongly favors high capability in generating quality samples by leveraging the principles of denoising diffusion process. Extensive experiments on both class-conditional and text-to-image tasks demonstrate its superiority, achieving the state-of-the-art FID score of 1.47 and 5.27 on ImageNet and MSCOCO datasets, respectively. More remarkably, our approach leads to 45% speedup in the inference time of image generation against the diffusion models such as DiT-XL/2.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio