MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models

Jianhong Tu; Zhuohao Ni; Nicholas Crispino; Zihao Yu; Michael Bendersky; Beliz Gunel; Ruoxi Jia; Xin Liu; Lingjuan Lyu; Dawn Song; Chenguang Wang

2025 ACL ACL 2025

MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models

Abstract

AbstractWe present a novel visual instruction tuning strategy to improve the zero-shot task generalization of multimodal large language models by building a firm text-only knowledge base. Existing work lacks sufficient experimentation on the importance of each modality in the instruction tuning stage, often using a majority of vision-language data while keeping text-only data limited and fixing mixtures of modalities. By incorporating diverse text-only data in the visual instruction tuning stage, we vary vision-language data in various controlled experiments to investigate the importance of modality in visual instruction tuning. Our comprehensive evaluation shows that the text-heavy instruction tuning approach is able to perform on par with traditional vision-heavy mixtures on both modalities across 12 general datasets while using as low as half the total training tokens. We find that simply increasing sufficiently diverse text-only data enables transfer of instruction following ability and domain knowledge across modalities while being more efficient than the vision-language approach.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jianhong Tu , Zhuohao Ni , Nicholas Crispino , Zihao Yu , Michael Bendersky , Beliz Gunel , Ruoxi Jia , Xin Liu , Lingjuan Lyu , Dawn Song , Chenguang Wang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Learning Paradigms > Transfer Learning Natural Language Processing > Resources & Methods > Large Language Models Machine Learning > Learning Types > Transfer Learning Artificial Intelligence > Core AI > Large Language Models Machine Learning > Learning Types > Fine-Tuning Deep Learning > Learning Types > Transfer Learning Deep Learning > Models > Multimodal Learning

Keywords

zero-shot learning transfer learning knowledge transfer multimodal learning instruction tuning vision-language model zero-shot generalization multimodal language model text-only training

Download PDF

Graphically Speaking: Unmasking Abuse in Social Media with Conversation Insights 2025

CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision 2025

Structural Deep Encoding for Table Question Answering 2025

Vision-aided Unsupervised Constituency Parsing with Multi-MLLM Debating 2025

MLAN: Language-Based Instruction Tuning Preserves and Transfers Knowledge in Multimodal Language Models

Abstract

Authors

Topics

Keywords

Related papers