2026 WACV WACV 2026

Diffusion Noise Optimization for Synthetic VLM Training

Abstract

Recent advances in image generation models have enabled the production of high-quality images, making synthetic images a promising alternative to real images for dataset construction. However, a critical challenge remains in that the performance of Vision-Language Models (VLMs) tends to degrade as the proportion of synthetic images in a dataset increases in conventional approaches. To alleviate the challenge, we introduce a plug-and-play dataset construction framework that enhances text-to-image diffusion models by optimizing their initial noise. Our method treats the initial noise as a learnable parameter and iteratively updates it to maximize text-image alignment based on multiple embedding models without retraining the generator. Since the initial noise plays a crucial role in determining the quality of the synthetic image, its optimization enables the search for initial conditions that yield semantically faithful and realistic images. By improving FID and text-image alignment compared to conventional latent diffusion model (LDM)-based methods, our approach produces synthetic images better suited for training. When CLIP models were trained on such images, they achieved up to +5.09% higher Average R@1 in zero-shot retrieval, +2.88% higher Average top-1 accuracy in zero-shot classification, and +5.05% higher performance in linear-probing. These results demonstrate that initial noise optimization is an effective and scalable strategy for enabling robust VLM training with synthetic images.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio