STORYSUMM: Evaluating Faithfulness in Story Summarization

Melanie Subbiah; Faisal Ladhak; Akankshya Mishra; Griffin Thomas Adams; Lydia Chilton; Kathleen McKeown

2024 EMNLP EMNLP 2024

STORYSUMM: Evaluating Faithfulness in Story Summarization

Abstract

AbstractHuman evaluation has been the gold standard for checking faithfulness in abstractive summarization. However, with a challenging source domain like narrative, multiple annotators can agree a summary is faithful, while missing details that are obvious errors only once pointed out. We therefore introduce a new dataset, StorySumm, comprising LLM summaries of short stories with localized faithfulness labels and error explanations. This benchmark is for evaluation methods, testing whether a given method can detect challenging inconsistencies. Using this dataset, we first show that any one human annotation protocol is likely to miss inconsistencies, and we advocate for pursuing a range of methods when establishing ground truth for a summarization dataset. We finally test recent automatic metrics and find that none of them achieve more than 70% balanced accuracy on this task, demonstrating that it is a challenging benchmark for future work in faithfulness evaluation.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — human annotation protocol

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Melanie Subbiah , Faisal Ladhak , Akankshya Mishra , Griffin Thomas Adams , Lydia Chilton , Kathleen McKeown

Topics

Artificial Intelligence > Core AI > Interpretability Machine Learning > Optimization & Theory > Theory Natural Language Processing > Applications > Summarization Machine Learning > Learning Types > Evaluation

Keywords

benchmark evaluation natural language generation benchmark dataset language model human evaluation faithfulness evaluation abstractive summarization automatic evaluation metric story summarization human annotation protocol

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024