Continual SFT Matches Multimodal RLHF with Negative Supervision

Ke Zhu; Yu Wang; Yanpeng Sun; Qiang Chen; Jiangjiang Liu; Gang Zhang; Jingdong Wang

2025 CVPR CVPR 2025

Continual SFT Matches Multimodal RLHF with Negative Supervision

Abstract

Multimodal RLHF usually happens after supervised finetuning (SFT) stage to continually improve vision-language models' (VLMs) comprehension. Conventional wisdom holds its superiority over continual SFT during this preference alignment stage. In this paper, we observe that the inherent value of multimodal RLHF lies in its negative supervision, the logit of the rejected responses. We thus propose a novel negative supervised finetuning (nSFT) approach that fully excavates these information resided. Our nSFT disentangles this negative supervision in RLHF paradigm, and continually aligns VLMs with a simple SFT loss. This is more memory efficient than multimodal RLHF where 2 (e.g., DPO) or 4 (e.g., PPO) large VLMs are strictly required. The effectiveness of nSFT is rigorously proved by comparing it with various multimodal RLHF approaches, across different dataset sources, base VLMs and evaluation metrics. Besides, fruitful of ablations are provided to support our hypothesis. Code will be found in https://github.com/Kevinz-code/nSFT/.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — multimodal rlhf

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ke Zhu , Yu Wang , Yanpeng Sun , Qiang Chen , Jiangjiang Liu , Gang Zhang , Jingdong Wang

Topics

Deep Learning > Architectures > Transformers Deep Learning > Techniques > Pretraining Machine Learning > Learning Types > Reinforcement Learning Deep Learning > Models > Large Language Models Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > Fine-Tuning Deep Learning > Learning Types > Reinforcement Learning from Human Feedback

Keywords

direct preference optimization preference alignment vision language model vision-language model supervised fine-tuning supervised finetuning negative supervision multimodal reinforcement learning multimodal rlhf

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025