SummEdits: Measuring LLM Ability at Factual Reasoning Through The Lens of Summarization

Philippe Laban; Wojciech Kryscinski; Divyansh Agarwal; Alexander Fabbri; Caiming Xiong; Shafiq Joty; Chien-Sheng Wu

2023 EMNLP EMNLP 2023

SummEdits: Measuring LLM Ability at Factual Reasoning Through The Lens of Summarization

Abstract

AbstractWith the recent appearance of LLMs in practical settings, having methods that can effectively detect factual inconsistencies is crucial to reduce the propagation of misinformation and improve trust in model outputs. When testing on existing factual consistency benchmarks, we find that a few large language models (LLMs) perform competitively on classification benchmarks for factual inconsistency detection compared to traditional non-LLM methods. However, a closer analysis reveals issues with existing evaluation benchmarks, affecting evaluation precision. To address this, we propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. This new benchmark is 20 times more cost-effective per sample than previous benchmarks and highly reproducible, as we estimate inter-annotator agreement at about 0.9. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance, highlighting the gaps in LLMs’ ability to reason about facts and detect inconsistencies when they occur.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Philippe Laban , Wojciech Kryscinski , Divyansh Agarwal , Alexander Fabbri , Caiming Xiong , Shafiq Joty , Chien-Sheng Wu

Topics

Artificial Intelligence > Learning Paradigms > Few-Shot Learning Natural Language Processing > Applications > Machine Reading Comprehension

Keywords

few-shot learning summarization evaluation factual inconsistency factual reasoning benchmark creation

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023