Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training

Fenghua Weng; Jian Lou; Jun Feng; Minlie Huang; Wenjie Wang

2025 EMNLP EMNLP 2025

Adversary-Aware DPO: Enhancing Safety Alignment in Vision Language Models via Adversarial Training

Abstract

AbstractSafety alignment is critical in pre-trained large language models (LLMs) to generate responses aligned with human values and refuse harmful queries. Unlike LLM, the current safety alignment of VLMs is often achieved with post-hoc safety fine-tuning. However, these methods are less effective to white-box attacks. To address this, we propose Adversary-aware DPO (ADPO), a novel training framework that explicitly considers adversary. Adversary-aware DPO (ADPO) integrates adversarial training into DPO to enhance the safety alignment of VLMs under worst-case adversarial perturbations. ADPO introduces two key components: (1) an adversarial-trained reference model that generates human-preferred responses under worst-case perturbations, and (2) an adversary-aware DPO loss that generates winner-loser pairs accounting for adversarial distortions. By combining these innovations, ADPO ensures that VLMs remain robust and reliable even in the presence of sophisticated jailbreak attacks. Extensive experiments demonstrate that ADPO outperforms baselines in terms of both safety alignment and general utility of VLMs.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Fenghua Weng , Jian Lou , Jun Feng , Minlie Huang , Wenjie Wang

Topics

Artificial Intelligence > Core AI > AI Safety Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Adversarial Learning Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Adversarial Learning Deep Learning > Learning Types > Adversarial Learning

Keywords

direct preference optimization adversarial training safety alignment adversarial perturbation vision-language model jailbreak attack

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025