FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute

Sotiris Anagnostidis; Gregor Bachmann; Yeongmin Kim; Jonas Köhler; Markos Georgopoulos; Artsiom Sanakoyeu; Yuming Du; Albert Pumarola; Ali Thabet; Edgar Schonfeld

2025 CVPR CVPR 2025

FlexiDiT: Your Diffusion Transformer Can Easily Generate High-Quality Samples with Less Compute

Abstract

Despite their remarkable performance, modern Diffusion Transformers (DiTs) are hindered by substantial resource requirements during inference, stemming from the fixed and large amount of compute needed for each denoising step. In this work, we revisit the conventional static paradigm that allocates a fixed compute budget per denoising iteration and propose a dynamic strategy instead. Our simple and sample-efficient framework enables pre-trained DiT models to be converted into flexible ones --- dubbed FlexiDiT --- allowing them to process inputs at varying compute budgets. We demonstrate how a single flexible model can generate images without any drop in quality, while reducing the required FLOPs by more than 40% compared to their static counterparts, for both class-conditioned and text-conditioned image generation. Our method is general and agnostic to input and conditioning modalities. We show how our approach can be readily extended for video generation, where FlexiDiT models generate samples with up to 75% less compute without compromising performance.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — dynamic compute

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sotiris Anagnostidis , Gregor Bachmann , Yeongmin Kim , Jonas Köhler , Markos Georgopoulos , Artsiom Sanakoyeu , Yuming Du , Albert Pumarola , Ali Thabet , Edgar Schonfeld

Topics

Artificial Intelligence > Core AI > Model Compression Machine Learning > Application Areas > Efficient Computing Deep Learning > Architectures > Transformers Deep Learning > Models > Diffusion Models Computer Vision > Generation > Image Generation Deep Learning > Learning Types > Generative Models

Keywords

image generation video generation efficient computing generative model compute efficiency inference efficiency diffusion transformer compute budget dynamic compute

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025