Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation

Yuhui Zhang; Brandon McKinzie; Zhe Gan; Vaishaal Shankar; Alexander T Toshev

2024 EMNLP EMNLP 2024

Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation

Abstract

AbstractRecent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to language modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream tasks. In this work, we explore this gap by adapting a pre-trained language model for auto-regressive text-to-image generation, and find that pre-trained language models offer limited help. We provide a two-fold explanation by analyzing tokens from each modality. First, we demonstrate that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained language models no more effective in modeling them than randomly initialized ones. Second, the text tokens in the image-text datasets are too simple compared to normal language model pre-training data, which causes the catastrophic degradation of language models’ capability.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yuhui Zhang , Brandon McKinzie , Zhe Gan , Vaishaal Shankar , Alexander T Toshev

Topics

Artificial Intelligence > Learning Paradigms > Transfer Learning Deep Learning > Architectures > Transformers Deep Learning > Models > Generative Models

Keywords

catastrophic forgetting text-to-image generation pre-trained language model auto-regressive generation image tokenization

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024