VCGD: Visual Clue Guided Decoding with Caption Model for Mitigating Hallucination in Multimodal Large Language Models

Guoqing Chen; Fu Zhang; Bingqian Liu; Chenglong Lu; Jingwei Cheng

2026 AAAI AAAI 2026

VCGD: Visual Clue Guided Decoding with Caption Model for Mitigating Hallucination in Multimodal Large Language Models

Abstract

Abstract Multimodal large language models (MLLMs) demonstrate strong capabilities in multimodal understanding, reasoning, and interaction but still face the fundamental limitation of hallucinations, where they generate erroneous or fabricated information. Most existing research induces hallucinations by manually perturbing visual or instruction inputs, then uses output differences or model-generated descriptions as references to mitigate hallucinations and improve responsevisual consistency. However, these methods are constrained by model capabilities and prone to hallucination propagation. We propose Visual Clue Guided Decoding (VCGD), a novel decoding strategy that introduces an auxiliary Caption Model to generate precise visual clues during decoding for guiding model generation. It further incorporates image confidence constraints to critically suppress hallucination propagation during generation, thereby significantly improving content reliability and visual consistency. Specifically, VCGD leverages high-quality visual descriptions to guide MLLMs in correcting perceptual biases while generating answers. Furthermore, we introduce a Reinforcement Learning-based training paradigm for the Caption Model, in which a Reward Agent provides feedback on the quality of visual clues, further enhancing the accuracy of visual information. Extensive experiments across multiple benchmark datasets and state-of-the-art MLLMs demonstrate that VCGD significantly reduces hallucination rates and improves cross-modal consistency. Our method exhibits strong generalizability and scalability, offering an effective decoding enhancement strategy that can be seamlessly integrated into existing multimodal frameworks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — caption model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Guoqing Chen , Fu Zhang , Bingqian Liu , Chenglong Lu , Jingwei Cheng

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Fairness Machine Learning > Learning Types > Reinforcement Learning

Keywords

reinforcement learning multimodal large language model hallucination mitigation decoding strategy visual clue caption model

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026