Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation

Sérgio Jesus; José Pombal; Duarte Alves; André Cruz; Pedro Saleiro; Rita Ribeiro; João Gama; Pedro Bizarro

2022 NIPS NeurIPS 2022

Turning the Tables: Biased, Imbalanced, Dynamic Tabular Datasets for ML Evaluation

Abstract

Evaluating new techniques on realistic datasets plays a crucial role in the development of ML research and its broader adoption by practitioners. In recent years, there has been a significant increase of publicly available unstructured data resources for computer vision and NLP tasks. However, tabular data — which is prevalent in many high-stakes domains — has been lagging behind. To bridge this gap, we present Bank Account Fraud (BAF), the first publicly available 1 privacy-preserving, large-scale, realistic suite of tabular datasets. The suite was generated by applying state-of-the-art tabular data generation techniques on an anonymized,real-world bank account opening fraud detection dataset. This setting carries a set of challenges that are commonplace in real-world applications, including temporal dynamics and significant class imbalance. Additionally, to allow practitioners to stress test both performance and fairness of ML methods, each dataset variant of BAF contains specific types of data bias. With this resource, we aim to provide the research community with a more realistic, complete, and robust test bed to evaluate novel and existing methods.

🌉 Interdisciplinary Bridge — Data Science & Analytics and Machine Learning

🐣 Hot Topic Early Bird — fairness evaluation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sérgio Jesus , José Pombal , Duarte Alves , André Cruz , Pedro Saleiro , Rita Ribeiro , João Gama , Pedro Bizarro

Topics

Machine Learning > Core Methods > Classification Machine Learning > Application Areas > Fairness Machine Learning > Learning Types > Fairness Data Science & Analytics > Applications > Risk Management Machine Learning > Application Areas > Classification

Keywords

class imbalance fraud detection benchmark dataset tabular datum fairness evaluation data bia tabular dataset

Download PDF

Related papers

Transferring Pre-trained Multimodal Representations with Cross-modal Similarity Matching 2022

A Theoretical View on Sparsely Activated Networks 2022

Prune and distill: similar reformatting of image information along rat visual cortex and deep neural networks 2022

Matryoshka Representation Learning 2022

Off-Policy Evaluation with Deficient Support Using Side Information 2022