Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers

Yuxia Wang; Revanth Gangi Reddy; Zain Muhammad Mujahid; Arnav Arora; Aleksandr Rubashevskii; Jiahui Geng; Osama Mohammed Afzal; Liangming Pan; Nadav Borenstein; Aditya Pillai; Isabelle Augenstein; Iryna Gurevych; Preslav Nakov

2024 EMNLP EMNLP 2024

Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers

Abstract

AbstractThe increased use of large language models (LLMs) across a variety of real-world applications calls for mechanisms to verify the factual accuracy of their outputs. In this work, we present Factcheck-Bench, a holistic end-to-end framework for annotating and evaluating the factuality of LLM-generated responses, which encompasses a multi-stage annotation scheme designed to yield detailed labels for fact-checking and correcting not just the final prediction, but also the intermediate steps that a fact-checking system might need to take. Based on this framework, we construct an open-domain factuality benchmark in three-levels of granularity: claim, sentence, and document. We further propose a system, Factcheck-GPT, which follows our framework, and we show that it outperforms several popular LLM fact-checkers. We make our annotation tool, annotated data, benchmark, and code available at https://github.com/yuxiaw/Factcheck-GPT.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🐣 Hot Topic Early Bird — claim verification

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yuxia Wang , Revanth Gangi Reddy , Zain Muhammad Mujahid , Arnav Arora , Aleksandr Rubashevskii , Jiahui Geng , Osama Mohammed Afzal , Liangming Pan , Nadav Borenstein , Aditya Pillai , Isabelle Augenstein , Iryna Gurevych , Preslav Nakov

Topics

Natural Language Processing > Applications > Fact-Checking Natural Language Processing > Resources & Methods > Large Language Models Machine Learning > Learning Types > Evaluation Artificial Intelligence > Core AI > Natural Language Processing Deep Learning > Learning Types > Evaluation

Keywords

benchmark evaluation factual accuracy claim verification language model evaluation evaluation benchmark annotation scheme fact checking large language model

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024