Studying Summarization Evaluation Metrics in the Appropriate Scoring Range

Maxime Peyrard

2019 ACL ACL 2019

Studying Summarization Evaluation Metrics in the Appropriate Scoring Range

Abstract

AbstractIn summarization, automatic evaluation metrics are usually compared based on their ability to correlate with human judgments. Unfortunately, the few existing human judgment datasets have been created as by-products of the manual evaluations performed during the DUC/TAC shared tasks. However, modern systems are typically better than the best systems submitted at the time of these shared tasks. We show that, surprisingly, evaluation metrics which behave similarly on these datasets (average-scoring range) strongly disagree in the higher-scoring range in which current systems now operate. It is problematic because metrics disagree yet we can’t decide which one to trust. This is a call for collecting human judgments for high-scoring summaries as this would resolve the debate over which metrics to trust. This would also be greatly beneficial to further improve summarization systems and metrics alike.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Interdisciplinary and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — automatic evaluation metric

🐣 Hot Topic Early Bird — summarization evaluation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Maxime Peyrard

Topics

Artificial Intelligence > Core AI > Interpretability Machine Learning > Optimization & Theory > Theory Interdisciplinary > Linguistics > Computational Linguistics Natural Language Processing > Applications > Summarization Machine Learning > Optimization & Theory > Evaluation

Keywords

text summarization summarization evaluation human judgment human judgment correlation automatic evaluation metric correlation rouge score automatic evaluation metric scoring range analysis

Download PDF

Related papers

What do phone embeddings learn about Phonology? 2019

Unsupervised Morphological Segmentation for Low-Resource Polysynthetic Languages 2019

Understanding Undesirable Word Embedding Associations 2019

Inferential Machine Comprehension: Answering Questions by Recursively Deducing the Evidence Chain from Text 2019

Domain Adaptation of Neural Machine Translation by Lexicon Induction 2019