Mitigating Open-Vocabulary Caption Hallucinations

Assaf Ben-Kish; Moran Yanuka; Morris Alper; Raja Giryes; Hadar Averbuch-Elor

2024 EMNLP EMNLP 2024

Mitigating Open-Vocabulary Caption Hallucinations

Abstract

AbstractWhile recent years have seen rapid progress in image-conditioned text generation, image captioning still suffers from the fundamental issue of hallucinations, namely, the generation of spurious details that cannot be inferred from the given image. Existing methods largely use closed-vocabulary object lists to mitigate or evaluate hallucinations in image captioning, ignoring the long-tailed nature of hallucinations that occur in practice. To this end, we propose a framework for addressing hallucinations in image captioning in the open-vocabulary setting. Our framework includes a new benchmark, OpenCHAIR, that leverages generative foundation models to evaluate open-vocabulary object hallucinations for image captioning, surpassing the popular and similarly-sized CHAIR benchmark in both diversity and accuracy. Furthermore, to mitigate open-vocabulary hallucinations without using a closed object list, we propose MOCHa, an approach harnessing advancements in reinforcement learning. Our multi-objective reward function explicitly targets the trade-off between fidelity and adequacy in generations without requiring any strong supervision. MOCHa improves a large variety of image captioning models, as captured by our OpenCHAIR benchmark and other existing metrics. We will release our code and models.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Reinforcement Learning

🧭 Keyword Pioneer — open vocabulary hallucination

🐣 Hot Topic Early Bird — multimodal generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Assaf Ben-Kish , Moran Yanuka , Morris Alper , Raja Giryes , Hadar Averbuch-Elor

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Generation > Image Captioning Reinforcement Learning > Methods > Policy Learning Computer Vision > Core AI > Multimodal Learning Deep Learning > Learning Types > Reinforcement Learning Machine Learning > Learning Types > Multi-Objective Optimization Computer Vision > Applications > Computer Vision

Keywords

reinforcement learning image captioning foundation model hallucination mitigation multimodal generation open-vocabulary generation generative foundation model multi-objective reward caption evaluation open vocabulary hallucination

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024