One-Minute Video Generation with Test-Time Training

Karan Dalal; Daniel Koceja; Jiarui Xu; Yue Zhao; Shihao Han; Ka Chun Cheung; Jan Kautz; Yejin Choi; Yu Sun; Xiaolong Wang

2025 CVPR CVPR 2025

One-Minute Video Generation with Test-Time Training

Abstract

Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for long context. Alternatives such as Mamba layers struggle to produce coherent scenes because their hidden states are small and less expressive. We experiment with Test-Time Training (TTT) layers, whose hidden states themselves can be neural networks, therefore larger and more expressive. Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos from text storyboards. We curate a dataset based on Tom and Jerry cartoons as a proof-of-concept benchmark. Compared to baselines such as Mamba 2, Gated DeltaNet, and sliding-window attention layers, TTT layers generate much more coherent videos that tell complete stories, leading by 34 Elo points in a human evaluation of 100 videos per method. Although promising, our results are still limited in physical realism, and the efficiency of our implementation can be further improved. Sample videos, code and annotations are available at: https://test-time-training.github.io/video-dit

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Karan Dalal , Daniel Koceja , Jiarui Xu , Yue Zhao , Shihao Han , Ka Chun Cheung , Jan Kautz , Yejin Choi , Yu Sun , Xiaolong Wang

Topics

Artificial Intelligence > Learning Paradigms > Meta-Learning Deep Learning > Architectures > Transformers Deep Learning > Models > Generative Models

Keywords

video generation state-space model long context test-time training

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025