Taming Teacher Forcing for Masked Autoregressive Video Generation

Deyu Zhou; Quan Sun; Yuang Peng; Kun Yan; Runpei Dong; Duomin Wang; Zheng Ge; Nan Duan; Xiangyu Zhang

2025 CVPR CVPR 2025

Taming Teacher Forcing for Masked Autoregressive Video Generation

Abstract

We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation. Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones (namely Masked Teacher Forcing, MTF), enabling a smooth transition from token-level (patch-level) to frame-level autoregressive generation. CTF significantly outperforms MTF, achieving a 23% improvement in FVD scores on first-frame conditioned video prediction. To address issues like exposure bias, we employ targeted training strategies, setting a new benchmark in autoregressive video generation. Experiments show that MAGI can generate long, coherent video sequences exceeding 100 frames, even when trained on as few as 16 frames, highlighting its potential for scalable, high-quality video generation.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — masked autoregressive model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Deyu Zhou , Quan Sun , Yuang Peng , Kun Yan , Runpei Dong , Duomin Wang , Zheng Ge , Nan Duan , Xiangyu Zhang

Topics

Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Models > Diffusion Models Deep Learning > Models > Generative Models Deep Learning > Techniques > Model Architecture Computer Vision > Generation > Video Generation Deep Learning > Techniques > Self-Supervised Learning

Keywords

video generation autoregressive generation diffusion model masked modeling teacher forcing frame prediction masked autoregressive model token-level generation

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025