U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers

Yuchuan Tian; Zhijun Tu; Hanting Chen; Jie Hu; Chao Xu; Yunhe Wang

2024 NIPS NeurIPS 2024

U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers

Abstract

Diffusion Transformers (DiTs) introduce the transformer architecture to diffusion tasks for latent-space image generation. With an isotropic architecture that chains a series of transformer blocks, DiTs demonstrate competitive performance and good scalability; but meanwhile, the abandonment of U-Net by DiTs and their following improvements is worth rethinking. To this end, we conduct a simple toy experiment by comparing a U-Net architectured DiT with an isotropic one. It turns out that the U-Net architecture only gain a slight advantage amid the U-Net inductive bias, indicating potential redundancies within the U-Net-style DiT. Inspired by the discovery that U-Net backbone features are low-frequency-dominated, we perform token downsampling on the query-key-value tuple for self-attention and bring further improvements despite a considerable amount of reduction in computation. Based on self-attention with downsampled tokens, we propose a series of U-shaped DiTs (U-DiTs) in the paper and conduct extensive experiments to demonstrate the extraordinary performance of U-DiT models. The proposed U-DiT could outperform DiT-XL with only 1/6 of its computation cost. Codes are available at https://github.com/YuchuanTian/U-DiT.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🧭 Keyword Pioneer — u-shaped architecture

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yuchuan Tian , Zhijun Tu , Hanting Chen , Jie Hu , Chao Xu , Yunhe Wang

Topics

Deep Learning > Architectures > Transformers Deep Learning > Models > Diffusion Models Deep Learning > Techniques > Model Architecture Computer Vision > Generation > Image Generation Deep Learning > Models > Transformers

Keywords

vision transformer image generation u-net architecture latent space diffusion transformer u-shaped architecture token downsampling latent image generation

Download PDF

Related papers

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers 2024

Training for Stable Explanation for Free 2024

NeuralSolver: Learning Algorithms For Consistent and Efficient Extrapolation Across General Tasks 2024

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch 2024

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence 2024