Goku: Flow Based Video Generative Foundation Models

Shoufa Chen; Chongjian GE; Yuqi Zhang; Yida Zhang; Fengda Zhu; Hao Yang; Hongxiang Hao; Hui Wu; Zhichao Lai; Yifei Hu; Ting-Che Lin; Shilong Zhang; Fu Li; Chuan Li; Xing Wang; Yanghua Peng; Peize Sun; Ping Luo; Yi Jiang; Zehuan Yuan; Bingyue Peng; Xiaobing Liu

2025 CVPR CVPR 2025

Goku: Flow Based Video Generative Foundation Models

Abstract

This paper introduces Goku, a state-of-the-art family of joint image-and-video generation models leveraging rectified flow Transformers to achieve industry-leading performance. We detail the foundational elements enabling high-quality visual generation, including the data curation pipeline, model architecture design, flow formulation, and advanced infrastructure for efficient and robust large-scale training. The Goku models demonstrate superior performance in both qualitative and quantitative evaluations, setting new benchmarks across major tasks. Specifically, Goku achieves 0.76 on GenEval and 83.65 on DPG-Bench for text-to-image generation, and 84.85 on VBench for text-to-video tasks. We believe that this work provides valuable insights and practical advancements for the research community in developing joint image-and-video generation models.

👥 Mega-Team — 22 authors

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — flow transformer

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shoufa Chen , Chongjian GE , Yuqi Zhang , Yida Zhang , Fengda Zhu , Hao Yang , Hongxiang Hao , Hui Wu , Zhichao Lai , Yifei Hu , Ting-Che Lin , Shilong Zhang , Fu Li , Chuan Li , Xing Wang , Yanghua Peng , Peize Sun , Ping Luo , Yi Jiang , Zehuan Yuan , Bingyue Peng , Xiaobing Liu

Topics

Artificial Intelligence > Core AI > Foundation Models Deep Learning > Models > Diffusion Models Deep Learning > Models > Generative Models Computer Vision > Generation > Image Generation Computer Vision > Generation > Video Generation

Keywords

transformer architecture image generation video generation text-to-image generation flow matching flow-based model text-to-video generation rectified flow visual generation flow transformer generative foundation model

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025