THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models

Prannay Kaul; Zhizhong Li; Hao Yang; Yonatan Dukler; Ashwin Swaminathan; C. J. Taylor; Stefano Soatto

2024 CVPR CVPR 2024

THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models

Abstract

Mitigating hallucinations in large vision-language models (LVLMs) remains an open problem. Recent benchmarks do not address hallucinations in open-ended free-form responses which we term "Type I hallucinations". Instead they focus on hallucinations responding to very specific question formats---typically a multiple-choice response regarding a particular object or attribute---which we term "Type II hallucinations". Additionally such benchmarks often require external API calls to models which are subject to change. In practice we observe that a reduction in Type II hallucinations does not lead to a reduction in Type I hallucinations but rather that the two forms of hallucinations are often anti-correlated. To address this we propose THRONE a novel object-based automatic framework for quantitatively evaluating Type I hallucinations in LVLM free-form outputs. We use public language models (LMs) to identify hallucinations in LVLM responses and compute informative metrics. By evaluating a large selection of recent LVLMs using public datasets we show that an improvement in existing metrics do not lead to a reduction in Type I hallucinations and that established benchmarks for measuring Type I hallucinations are incomplete. Finally we provide a simple and effective data augmentation method to reduce Type I and Type II hallucinations as a strong baseline.

🧭 Keyword Pioneer — free-form generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Prannay Kaul , Zhizhong Li , Hao Yang , Yonatan Dukler , Ashwin Swaminathan , C. J. Taylor , Stefano Soatto

Topics

Artificial Intelligence > Core AI > Interpretability Artificial Intelligence > Core AI > Responsible AI

Keywords

benchmark evaluation large vision-language model hallucination detection free-form generation object-based evaluation

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024