Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation

Bo Pang; Erik Nijkamp; Wenjuan Han; Linqi Zhou; Yixian Liu; Kewei Tu

2020 ACL ACL 2020

Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation

Abstract

AbstractOpen-domain dialogue generation has gained increasing attention in Natural Language Processing. Its evaluation requires a holistic means. Human ratings are deemed as the gold standard. As human evaluation is inefficient and costly, an automated substitute is highly desirable. In this paper, we propose holistic evaluation metrics that capture different aspects of open-domain dialogues. Our metrics consist of (1) GPT-2 based context coherence between sentences in a dialogue, (2) GPT-2 based fluency in phrasing, (3) n-gram based diversity in responses to augmented queries, and (4) textual-entailment-inference based logical self-consistency. The empirical validity of our metrics is demonstrated by strong correlations with human judgments. We open source the code and relevant materials.

🧭 Keyword Pioneer — open-domain dialogue generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Machine Learning, Natural Language Processing, Speech & Audio

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

📈 Trend Setter — Evaluation

🐣 Hot Topic Early Bird — evaluation metrics