CogView: Mastering Text-to-Image Generation via Transformers

Ming Ding; Zhuoyi Yang; Wenyi Hong; Wendi Zheng; Chang Zhou; Da Yin; Junyang Lin; Xu Zou; Zhou Shao; Hongxia Yang; Jie Tang

2021 NIPS NeurIPS 2021

CogView: Mastering Text-to-Image Generation via Transformers

Abstract

Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🐣 Hot Topic Early Bird — text-to-image generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ming Ding , Zhuoyi Yang , Wenyi Hong , Wendi Zheng , Chang Zhou , Da Yin , Junyang Lin , Xu Zou , Zhou Shao , Hongxia Yang , Jie Tang

Topics

Deep Learning > Architectures > Transformers Deep Learning > Models > Generative Models Computer Vision > Generation > Image Generation Deep Learning > Learning Types > Generative Models

Keywords

transformer architecture text-to-image generation vector quantization generative adversarial network super resolution

Download PDF

Related papers

Mosaicking to Distill: Knowledge Distillation from Out-of-Domain Data 2021

On Model Calibration for Long-Tailed Object Detection and Instance Segmentation 2021

Test-Time Personalization with a Transformer for Human Pose Estimation 2021

NTopo: Mesh-free Topology Optimization using Implicit Neural Representations 2021

Scalable Intervention Target Estimation in Linear Models 2021