Dynabench: Rethinking Benchmarking in NLP

Douwe Kiela; Max Bartolo; Yixin Nie; Divyansh Kaushik; Atticus Geiger; Zhengxuan Wu; Bertie Vidgen; Grusha Prasad; Amanpreet Singh; Pratik Ringshia; Zhiyi Ma; Tristan Thrush; Sebastian Riedel; Zeerak Waseem; Pontus Stenetorp; Robin Jia; Mohit Bansal; Christopher Potts; Adina Williams

2021 NAACL NAACL 2021

Dynabench: Rethinking Benchmarking in NLP

Abstract

AbstractWe introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — dynamic benchmarking

🐣 Hot Topic Early Bird — dataset creation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Douwe Kiela , Max Bartolo , Yixin Nie , Divyansh Kaushik , Atticus Geiger , Zhengxuan Wu , Bertie Vidgen , Grusha Prasad , Amanpreet Singh , Pratik Ringshia , Zhiyi Ma , Tristan Thrush , Sebastian Riedel , Zeerak Waseem , Pontus Stenetorp , Robin Jia , Mohit Bansal , Christopher Potts , Adina Williams

Topics

Artificial Intelligence > Learning Paradigms > Few-Shot Learning Machine Learning > Learning Types > Active Learning Natural Language Processing > Applications > Text Classification

Keywords

active learning few-shot learning dataset creation model assessment dynamic benchmarking

Download PDF

Related papers

Knowledge Router: Learning Disentangled Representations for Knowledge Graphs 2021

Cross-Task Instance Representation Interactions and Label Dependencies for Joint Information Extraction with Graph Convolutional Networks 2021

Abstract Meaning Representation Guided Graph Encoding and Decoding for Joint Information Extraction 2021

Beyond Fair Pay: Ethical Implications of NLP Crowdsourcing 2021

Probing Word Translations in the Transformer and Trading Decoder for Encoder Layers 2021