MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics

Anthony Chen; Gabriel Stanovsky; Sameer Singh; Matt Gardner

2020 EMNLP EMNLP 2020

MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics

Abstract

AbstractPosing reading comprehension as a generation problem provides a great deal of flexibility, allowing for open-ended questions with few restrictions on possible answers. However, progress is impeded by existing generation metrics, which rely on token overlap and are agnostic to the nuances of reading comprehension. To address this, we introduce a benchmark for training and evaluating generative reading comprehension metrics: MOdeling Correctness with Human Annotations. MOCHA contains 40K human judgement scores on model outputs from 6 diverse question answering datasets and an additional set of minimal pairs for evaluation. Using MOCHA, we train a Learned Evaluation metric for Reading Comprehension, LERC, to mimic human judgement scores. LERC outperforms baseline metrics by 10 to 36 absolute Pearson points on held-out annotations. When we evaluate robustness on minimal pairs, LERC achieves 80% accuracy, outperforming baselines by 14 to 26 absolute percentage points while leaving significant room for improvement. MOCHA presents a challenging problem for developing accurate and robust generative reading comprehension metrics.

🌉 Interdisciplinary Bridge — Deep Learning and Natural Language Processing

📈 Trend Setter — Evaluation

🧭 Keyword Pioneer — generative evaluation

🐣 Hot Topic Early Bird — pearson correlation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Anthony Chen , Gabriel Stanovsky , Sameer Singh , Matt Gardner

Topics

Natural Language Processing > Generation > Text Generation Natural Language Processing > Applications > Machine Reading Comprehension Natural Language Processing > Applications > Question Answering Natural Language Processing > Resources & Methods > Natural Language Inference Deep Learning > Learning Types > Evaluation

Keywords

question answering generative evaluation reading comprehension machine reading comprehension pearson correlation evaluation benchmark evaluation metric human annotation generative metric human judgement score learned metrics

Download PDF

Related papers

Fast semantic parsing with well-typedness guarantees 2020

Detecting Objectifying Language in Online Professor Reviews 2020

Analogous Process Structure Induction for Sub-event Sequence Prediction 2020

Aspect Sentiment Classification with Aspect-Specific Opinion Spans 2020

Robust and Interpretable Grounding of Spatial References with Relation Networks 2020