MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance

Zihan Cao; Yu Zhong; Ziqi Wang; Liang-Jian Deng

2025 ICCV ICCV 2025

MMAIF: Multi-task and Multi-degradation All-in-One for Image Fusion with Language Guidance

Abstract

Image fusion, a fundamental low-level vision task, aims to integrate multiple image sequences into a single output while preserving as much information as possible from the input. However, existing methods face several significant limitations: 1) requiring task- or dataset-specific models; 2) neglecting real-world image degradations (e.g., noise), which causes failure when processing degraded inputs; 3) operating in pixel space, where attention mechanisms are computationally expensive; and 4) lacking user interaction capabilities.To address these challenges, we propose a unified framework for multi-task, multi-degradation, and language-guided image fusion. Our framework includes two key components: 1) a practical degradation pipeline that simulates real-world image degradations and generates interactive prompts to guide the model; 2) an all-in-one Diffusion Transformer (DiT) operating in latent space, which fuses a clean image conditioned on both the degraded inputs and the generated prompts. Furthermore, we introduce principled modifications to the original DiT architecture to better suit the fusion task. Based on this framework, we develop two versions of the model: Regression-based and Flow Matching-based variants.Extensive qualitative and quantitative experiments demonstrate that our approach effectively addresses the aforementioned limitations and outperforms previous restoration+fusion and all-in-one pipelines.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zihan Cao , Yu Zhong , Ziqi Wang , Liang-Jian Deng

Topics

Machine Learning > Application Areas > Domain Adaptation Deep Learning > Models > Diffusion Models Deep Learning > Techniques > Model Architecture

Keywords

image fusion latent space language guidance diffusion transformer

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025