Why We Need New Evaluation Metrics for NLG

Jekaterina Novikova; Ondřej Dušek; Amanda Cercas Curry; Verena Rieser

2017 EMNLP EMNLP 2017

Why We Need New Evaluation Metrics for NLG

Abstract

AbstractThe majority of NLG evaluation relies on automatic metrics, such as BLEU . In this paper, we motivate the need for novel, system- and data-independent automatic evaluation methods: We investigate a wide range of metrics, including state-of-the-art word-based and novel grammar-based ones, and demonstrate that they only weakly reflect human judgements of system outputs as generated by data-driven, end-to-end NLG. We also show that metric performance is data- and system-specific. Nevertheless, our results also suggest that automatic metrics perform reliably at system-level and can support system development by finding cases where a system performs poorly.

🧭 Keyword Pioneer — grammar-based metrics

🐣 Hot Topic Early Bird — bleu score

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Jekaterina Novikova , Ondřej Dušek , Amanda Cercas Curry , Verena Rieser

Topics

Natural Language Processing > Generation > Text Generation

Keywords

natural language generation bleu score evaluation metrics automatic evaluation grammar-based metrics

Download PDF

Related papers

Reinforced Video Captioning with Entailment Rewards 2017

Cross-lingual Character-Level Neural Morphological Tagging 2017

Inter-Weighted Alignment Network for Sentence Pair Modeling 2017

Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings 2017

An Empirical Analysis of Edit Importance between Document Versions 2017