X-Fusion: Introducing New Modality to Frozen Large Language Models

Sicheng Mo; Thao Nguyen; Xun Huang; Siddharth Srinivasan Iyer; Yijun Li; Yuchen Liu; Abhishek Tandon; Eli Shechtman; Krishna Kumar Singh; Yong Jae Lee; Bolei Zhou; Yuheng Li

2025 ICCV ICCV 2025

X-Fusion: Introducing New Modality to Frozen Large Language Models

Abstract

We propose X-Fusion, a framework that extends pretrained Large Language Models (LLMs) for multimodal tasks while preserving their language capabilities. X-Fusion employs a dual-tower design with modality-specific weights, keeping the LLM's parameters frozen while integrating vision-specific information for both understanding and generation. We find that incorporating understanding-focused data improves generation quality, reducing image data noise enhances overall performance, and feature alignment accelerates convergence for smaller models but has minimal impact on larger ones. Our findings provide valuable insights into building efficient unified multimodal models.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — dual-tower design

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sicheng Mo , Thao Nguyen , Xun Huang , Siddharth Srinivasan Iyer , Yijun Li , Yuchen Liu , Abhishek Tandon , Eli Shechtman , Krishna Kumar Singh , Yong Jae Lee , Bolei Zhou , Yuheng Li

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Model Merging

Keywords

multimodal learning frozen large language model dual-tower design modality-specific weight

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025