Image Difference Captioning via Adversarial Preference Optimization

Zihan Huang; Junda Wu; Rohan Surana; Tong Yu; David Arbour; Ritwik Sinha; Julian McAuley

2025 EMNLP EMNLP 2025

Image Difference Captioning via Adversarial Preference Optimization

Abstract

AbstractImage Difference Captioning (IDC) aims to generate natural language descriptions that highlight subtle differences between two visually similar images. While recent advances leverage pre-trained vision-language models to align fine-grained visual differences with textual semantics, existing supervised approaches often overfit to dataset-specific language patterns and fail to capture accurate preferences on IDC, which often indicates fine-grained and context-aware distinctions. To address these limitations, we propose an adversarial direct preference optimization (ADPO) framework for IDC, which formulates IDC as a preference optimization problem under the Bradley-Terry-Luce model, directly aligning the captioning policy with pairwise difference preferences via Direct Preference Optimization (DPO). To model more accurate and diverse IDC preferences, we introduce an adversarially trained hard negative retriever that selects counterfactual captions, This results in a minimax optimization problem, which we solve via policy-gradient reinforcement learning, enabling the policy and retriever to improve jointly. Experiments on benchmark IDC datasets show that our approach outperforms existing baselines, especially in generating fine-grained and accurate difference descriptions.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Reinforcement Learning

🧭 Keyword Pioneer — adversarial direct preference optimization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zihan Huang , Junda Wu , Rohan Surana , Tong Yu , David Arbour , Ritwik Sinha , Julian McAuley

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Adversarial Learning Computer Vision > Generation > Image Captioning Reinforcement Learning > Methods > Policy Learning Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > Preference Learning

Keywords

adversarial learning preference optimization image captioning fine-grained recognition vision-language model image difference captioning image difference adversarial direct preference optimization adversarial negative retriever policy-gradient reinforcement learning

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025