2021
NIPS
NeurIPS 2021
CogView: Mastering Text-to-Image Generation via Transformers
Abstract
Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.
🌉
Interdisciplinary Bridge
— Computer Vision and Deep Learning
🐣
Hot Topic Early Bird
— text-to-image generation
🐝
Cross-Pollinator
— Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio
Authors
Ming Ding
,
Zhuoyi Yang
,
Wenyi Hong
,
Wendi Zheng
,
Chang Zhou
,
Da Yin
,
Junyang Lin
,
Xu Zou
,
Zhou Shao
,
Hongxia Yang
,
Jie Tang