FreeGen: Bridging Visual-Linguistic Discrepancies Towards Diffusion-based Pixel-level Data Synthesis

Wenzhuang Wang; Mingcan Ma; Yong Chen; Changqun Xia; Zhenbao Liang; Jia Li

2025 AAAI AAAI 2025

FreeGen: Bridging Visual-Linguistic Discrepancies Towards Diffusion-based Pixel-level Data Synthesis

Abstract

Abstract Text-to-image diffusion model has inspired research into text-to-data synthesis without human intervention, where spatial attentions correlated with semantic entities in text prompts are primarily interpreted as pseudo-masks. However, these vannila attentions often deliver visual-linguistic discrepancies, in which the associations between image features and entity-level tokens are unstable and divergent, yielding inferior masks for realistic applications, especially in more practical open-vocabulary settings. To tackle this issue, we propose a novel text-guided self-driven generative paradigm, termed FreeGen, which addresses the discrepancies by recalibrating intrinsic visual-linguistic correlations and serves as a real-data-free method to automatically synthesize open-vocabulary pixel-level data for arbitrary entities. Specifically, we first learn an Attention Self-Rectification mechanism to reproject the inherent attention matrices to achieve robust semantic alignment, thereby obtaining class-discriminative masks. A Temporal Fluctuation Factor is present to assess mask quality based on its variation over uniform sampling timesteps, enabling the selection of reliable masks. These masks are then employed as self-supervised signals to support the learning of an Entity-level Grounding Decoder in a self-training manner, thus producing open-vocabulary segmentation results. Extensive experiments show that the existing segmenters trained on FreeGen narrow the performance gap with real data counterparts and remarkably outperform the state-of-the-art methods.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Wenzhuang Wang , Mingcan Ma , Yong Chen , Changqun Xia , Zhenbao Liang , Jia Li

Topics

Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Models > Diffusion Models Computer Vision > Processing > Image Segmentation Natural Language Processing > Applications > Information Extraction Computer Vision > Processing > Semantic Segmentation

Keywords

semantic segmentation self-supervised learning attention mechanism open-vocabulary segmentation text-to-image diffusion data synthesis visual-linguistic alignment

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025