Re-evaluating Evaluation in Text Summarization

Manik Bhandari; Pranav Narayan Gour; Atabak Ashfaq; Pengfei Liu; Graham Neubig

2020 EMNLP EMNLP 2020

Re-evaluating Evaluation in Text Summarization

Abstract

AbstractAutomated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not – for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we make an attempt to re-evaluate the evaluation method for text summarization: assessing the reliability of automatic metrics using top-scoring system outputs, both abstractive and extractive, on recently popular datasets for both system-level and summary-level evaluation settings. We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems. We release a dataset of human judgments that are collected from 25 top-scoring neural summarization systems (14 abstractive and 11 extractive).

🐣 Hot Topic Early Bird — automated evaluation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Manik Bhandari , Pranav Narayan Gour , Atabak Ashfaq , Pengfei Liu , Graham Neubig

Topics

Natural Language Processing > Generation > Summarization Natural Language Processing > Resources & Methods > Natural Language Inference

Keywords

text summarization automated evaluation human evaluation rouge metric system-level evaluation

Download PDF

Related papers

Fast semantic parsing with well-typedness guarantees 2020

Detecting Objectifying Language in Online Professor Reviews 2020

Analogous Process Structure Induction for Sub-event Sequence Prediction 2020

Aspect Sentiment Classification with Aspect-Specific Opinion Spans 2020

Robust and Interpretable Grounding of Spatial References with Relation Networks 2020