FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs

Forrest Sheng Bao; Miaoran Li; Renyi Qu; Ge Luo; Erana Wan; Yujia Tang; Weisi Fan; Manveer Singh Tamber; Suleman Kazi; Vivek Sourabh; Mike Qi; Ruixuan Tu; Chenyu Xu; Matthew Gonzales; Ofer Mendelevitch; Amin Ahmad

2025 NAACL NAACL 2025

FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs

Abstract

AbstractSummarization is one of the most common tasks performed by large language models (LLMs), especially in applications like Retrieval-Augmented Generation (RAG). However, existing evaluations of hallucinations in LLM-generated summaries, and evaluations of hallucination detection models both suffer from a lack of diversity and recency in the LLM and LLM families considered. This paper introduces FaithBench, a summarization hallucination benchmark comprising challenging hallucinations made by 10 modern LLMs from 8 different families, with ground truth annotations by human experts. “Challenging” here means summaries on which popular, state-of-the-art hallucination detection models, including GPT-4o-as-a-judge, disagreed on. Our results show GPT-4o and GPT-3.5-Turbo produce the least hallucinations. However, most state-of-the-art hallucination detection models have near 50% accuracies on FaithBench, indicating lots of room for future improvement.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Forrest Sheng Bao , Miaoran Li , Renyi Qu , Ge Luo , Erana Wan , Yujia Tang , Weisi Fan , Manveer Singh Tamber , Suleman Kazi , Vivek Sourabh , Mike Qi , Ruixuan Tu , Chenyu Xu , Matthew Gonzales , Ofer Mendelevitch , Amin Ahmad

Topics

Artificial Intelligence > Core AI > Interpretability Natural Language Processing > Generation > Summarization Natural Language Processing > Resources & Methods > Large Language Models

Keywords

retrieval-augmented generation hallucination detection fact checking large language model summarization benchmark

Download PDF

Few-shot Personalization of LLMs with Mis-aligned Responses 2025

NLI under the Microscope: What Atomic Hypothesis Decomposition Reveals 2025

Understanding Figurative Meaning through Explainable Visual Entailment 2025

CogLM: Tracking Cognitive Development of Large Language Models 2025

FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs

Abstract

Authors

Topics

Keywords

Related papers