OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

Xiangyu Zhao; Shengyuan Ding; Zicheng Zhang; Haian Huang; Maosong Cao; Weiyun Wang; Jiaqi Wang; Xinyu Fang; Wenhai Wang; Guangtao Zhai; Haodong Duan; Hua Yang; Kai Chen

2025 ACL ACL 2025

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

Abstract

AbstractRecent advancements in open-source multi-modal large language models (MLLMs) have primarily focused on enhancing foundational capabilities, leaving a significant gap in human preference alignment. This paper introduces OmniAlign-V, a comprehensive dataset of 200K high-quality training samples featuring diverse images, complex questions, and varied response formats to improve MLLMs’ alignment with human preferences. We also present MM-AlignBench, a human-annotated benchmark specifically designed to evaluate MLLMs’ alignment with human values. Experimental results show that finetuning MLLMs with OmniAlign-V, using Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO), significantly enhances human preference alignment while maintaining or enhancing performance on standard VQA benchmarks, preserving their fundamental capabilities.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Knowledge & Reasoning, Machine Learning, Natural Language Processing, Reinforcement Learning

Authors

Xiangyu Zhao , Shengyuan Ding , Zicheng Zhang , Haian Huang , Maosong Cao , Weiyun Wang , Jiaqi Wang , Xinyu Fang , Wenhai Wang , Guangtao Zhai , Haodong Duan , Hua Yang , Kai Chen

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Weakly Supervised Learning Natural Language Processing > Resources & Methods > Large Language Models Machine Learning > Learning Types > Reinforcement Learning from Human Feedback Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

visual question answering direct preference optimization multi-modal large language model supervised fine-tuning human preference alignment

Download PDF

Graphically Speaking: Unmasking Abuse in Social Media with Conversation Insights 2025

CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision 2025

Structural Deep Encoding for Table Question Answering 2025

Vision-aided Unsupervised Constituency Parsing with Multi-MLLM Debating 2025

OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference

Abstract

Authors

Topics

Keywords

Related papers