Adversarial Examples for Evaluating Reading Comprehension Systems

Robin Jia; Percy Liang

2017 EMNLP EMNLP 2017

Adversarial Examples for Evaluating Reading Comprehension Systems

Abstract

AbstractStandard accuracy metrics indicate that reading comprehension systems are making rapid progress, but the extent to which these systems truly understand language remains unclear. To reward systems with real language understanding abilities, we propose an adversarial evaluation scheme for the Stanford Question Answering Dataset (SQuAD). Our method tests whether systems can answer questions about paragraphs that contain adversarially inserted sentences, which are automatically generated to distract computer systems without changing the correct answer or misleading humans. In this adversarial setting, the accuracy of sixteen published models drops from an average of 75% F1 score to 36%; when the adversary is allowed to add ungrammatical sequences of words, average accuracy on four models decreases further to 7%. We hope our insights will motivate the development of new models that understand language more precisely.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — robustness testing

🐣 Hot Topic Early Bird — reading comprehension

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Robin Jia , Percy Liang

Topics

Machine Learning > Learning Types > Adversarial Learning Natural Language Processing > Applications > Machine Reading Comprehension Natural Language Processing > Applications > Question Answering Machine Learning > Learning Types > Evaluation Natural Language Processing > Applications > Reading Comprehension

Keywords

question answering reading comprehension natural language understanding adversarial example robustness testing adversarial evaluation

Download PDF

Related papers

Reinforced Video Captioning with Entailment Rewards 2017

Cross-lingual Character-Level Neural Morphological Tagging 2017

Inter-Weighted Alignment Network for Sentence Pair Modeling 2017

Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings 2017

An Empirical Analysis of Edit Importance between Document Versions 2017