2025 IJCAI IJCAI 2025

A Dual Stream Visual Tokenizer for LLM Image Generation

Abstract

We proposes a novel visual tokenizer by combining high-level semantic tokens and low-level pixel tokens to represent images, aiming to address the challenges of image-to-sequence conversion for Large Language Models (LLMs). Existing visual tokenizers, such as VQ-VAE and diffusion-based models, either struggle with token explosion as image resolution increases or fail to capture detailed structural information. Our method introduces a dual-token system: high-level semantic tokens capture the main content of the image, while low-level pixel tokens preserve structural details. By integrating these tokens in a hybrid architecture, we leverage a VQ-VAE branch to generate low-resolution guidance and a diffusion process to reconstruct high-resolution images with both semantic coherence and structural accuracy. This approach significantly reduces the number of required tokens and enhances image reconstruction quality, offering an efficient solution for tasks like image generation and understanding based on LLMs.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision
🧭 Keyword Pioneer — pixel token
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio