FUSE: Fine-Grained and Semantic-Aware Learning for Unified Image Understanding and Generation

Peng Zhang; Wanggui He; Mushui Liu; Wenyi Xiao; Siyu Zou; Yuan Li; Xingjian Wang; Guanghao Zhang; Yanpeng Liu; Weilong Dai; Jinlong Liu; Shuyi Ying; Ruikai Zhou; yunlong yu; Yubo Tao; Hai Lin; Hao Jiang

2026 AAAI AAAI 2026

FUSE: Fine-Grained and Semantic-Aware Learning for Unified Image Understanding and Generation

Abstract

Abstract Recent unified models have demonstrated that the reasoning capacity of Multimodal Large Language Models (MLLMs) can be leveraged to facilitate diffusion-based image generation with impressive flexibility and performance. However, approaches that rely heavily on MLLMs for high-level semantic encoding often struggle with fine-grained visual tasks like image editing and virtual try-on. To address this gap, we propose FUSE, a unified framework excelling at both high-level vision–language understanding and fine-grained generation. First, we introduce a Semantic-to-Detail Connector that pre-aligns fine-grained visual features with the MLLM's semantic space. This design counteracts the low-level information loss inherent in MLLM encodings, creating a unified representation that steers the diffusion process with both global semantics and rich local details. Second, to further enhance semantic awareness and detail preservation, we introduce Adaptive-GRPO, a post-training objective that dynamically balances semantic coherence against pixel-level fidelity. The integration of these two innovations allows FUSE to generate images that are both semantically faithful and visually fine-grained. Comprehensive experiments on text-to-image and instruction-guided editing benchmarks show that FUSE significantly outperforms existing unified baselines, achieving 0.89 on Geneval, 0.65 on WISE, and 3.88 on ImageEdit.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Peng Zhang , Wanggui He , Mushui Liu , Wenyi Xiao , Siyu Zou , Yuan Li , Xingjian Wang , Guanghao Zhang , Yanpeng Liu , Weilong Dai , Jinlong Liu , Shuyi Ying , Ruikai Zhou , yunlong yu , Yubo Tao , Hai Lin , Hao Jiang

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Interpretability Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Planning

Keywords

image generation semantic analysis image editing semantic alignment diffusion model multimodal large language model image understanding fine-grained visual understanding

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026