Item Response Theory for Efficient Human Evaluation of Chatbots

João Sedoc; Lyle Ungar

2020 EMNLP EMNLP 2020

Item Response Theory for Efficient Human Evaluation of Chatbots

Abstract

AbstractConversational agent quality is currently assessed using human evaluation, and often requires an exorbitant number of comparisons to achieve statistical significance. In this paper, we introduce Item Response Theory (IRT) for chatbot evaluation, using a paired comparison in which annotators judge which system responds better to the next turn of a conversation. IRT is widely used in educational testing for simultaneously assessing the ability of test takers and the quality of test questions. It is similarly well suited for chatbot evaluation since it allows the assessment of both models and the prompts used to evaluate them. We use IRT to efficiently assess chatbots, and show that different examples from the evaluation set are better suited for comparing high-quality (nearer to human performance) than low-quality systems. Finally, we use IRT to reduce the number of evaluation examples assessed by human annotators while retaining discriminative power.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Science and Machine Learning

🐣 Hot Topic Early Bird — human evaluation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

João Sedoc , Lyle Ungar

Topics

Computer Science > Applications > Software Engineering Machine Learning > Optimization & Theory > Evaluation Artificial Intelligence > Core AI > Dialogue Systems

Keywords

item response theory statistical evaluation human evaluation paired comparison conversational system chatbot evaluation

Download PDF

Related papers

Fast semantic parsing with well-typedness guarantees 2020

Detecting Objectifying Language in Online Professor Reviews 2020

Analogous Process Structure Induction for Sub-event Sequence Prediction 2020

Aspect Sentiment Classification with Aspect-Specific Opinion Spans 2020

Robust and Interpretable Grounding of Spatial References with Relation Networks 2020