FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

Sewon Min; Kalpesh Krishna; Xinxi Lyu; Mike Lewis; Wen-tau Yih; Pang Koh; Mohit Iyyer; Luke Zettlemoyer; Hannaneh Hajishirzi

2023 EMNLP EMNLP 2023

FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation

Abstract

AbstractEvaluating the factuality of long-form text generated by large language models (LMs) is non-trivial because (1) generations often contain a mixture of supported and unsupported pieces of information, making binary judgments of quality inadequate, and (2) human evaluation is time-consuming and costly. In this paper, we introduce FACTSCORE, a new evaluation that breaks a generation into a series of atomic facts and computes the percentage of atomic facts supported by a reliable knowledge source. We conduct an extensive human evaluation to obtain FACTSCOREs of people biographies generated by several state-of-the-art commercial LMs—InstructGPT, ChatGPT, and the retrieval-augmented PerplexityAI—and report new analysis demonstrating the need for such a fine-grained score (e.g., ChatGPT only achieves 58%). Since human evaluation is costly, we also introduce an automated model that estimates FACTSCORE using retrieval and a strong language model, with less than a 2% error rate. Finally, we use this automated metric to evaluate 6,500 generations from a new set of 13 recent LMs that would have cost $26K if evaluated by humans, with various findings: GPT-4 and ChatGPT are more factual than public models, and Vicuna and Alpaca are some of the best public models. FACTSCORE is available for public use via ‘pip install factscore‘.

🧭 Keyword Pioneer — atomic fact

🐣 Hot Topic Early Bird — long-form generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sewon Min , Kalpesh Krishna , Xinxi Lyu , Mike Lewis , Wen-tau Yih , Pang Koh , Mohit Iyyer , Luke Zettlemoyer , Hannaneh Hajishirzi

Topics

Natural Language Processing > Generation > Text Generation Natural Language Processing > Resources & Methods > Large Language Models

Keywords

language model evaluation retrieval-augmented generation long-form generation atomic fact factual precision

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023