Benchmark Data Repositories for Better Benchmarking

Rachel Longjohn; Markelle Kelly; Sameer Singh; Padhraic Smyth

2024 NIPS NeurIPS 2024

Benchmark Data Repositories for Better Benchmarking

Abstract

In machine learning research, it is common to evaluate algorithms via their performance on standard benchmark datasets. While a growing body of work establishes guidelines for---and levies criticisms at---data and benchmarking practices in machine learning, comparatively less attention has been paid to the data repositories where these datasets are stored, documented, and shared. In this paper, we analyze the landscape of these benchmark data repositories and the role they can play in improving benchmarking. This role includes addressing issues with both datasets themselves (e.g., representational harms, construct validity) and the manner in which evaluation is carried out using such datasets (e.g., overemphasis on a few datasets and metrics, lack of reproducibility). To this end, we identify and discuss a set of considerations surrounding the design and use of benchmark data repositories, with a focus on improving benchmarking practices in machine learning.

🌉 Interdisciplinary Bridge — Computer Science and Machine Learning

🧭 Keyword Pioneer — data repository

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Machine Learning, Natural Language Processing, Speech & Audio

Authors

Rachel Longjohn , Markelle Kelly , Sameer Singh , Padhraic Smyth

Topics

Machine Learning > Application Areas > Fairness Computer Science > Applications > Information Retrieval Data Science & Analytics > Applications > Information Retrieval Machine Learning > Optimization & Theory > Evaluation Machine Learning > Learning Types > Evaluation Machine Learning > Application Areas > Evaluation Data Science & Analytics > Applications > Data Mining

Keywords

evaluation methodology machine learning benchmark dataset data repository representational harm evaluation practice construct validity

Download PDF

Related papers

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers 2024

Training for Stable Explanation for Free 2024

NeuralSolver: Learning Algorithms For Consistent and Efficient Extrapolation Across General Tasks 2024

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch 2024

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence 2024