Intrinsic Self-Supervision for Data Quality Audits

Fabian Gröger; Simone Lionetti; Philippe Gottfrois; Alvaro Gonzalez-Jimenez; Ludovic Amruthalingam; Labelling Consortium; Matthew Groh; Alexander A. Navarini; Marc Pouly

2024 NIPS NeurIPS 2024

Intrinsic Self-Supervision for Data Quality Audits

Abstract

Benchmark datasets in computer vision often contain off-topic images, near duplicates, and label errors, leading to inaccurate estimates of model performance.In this paper, we revisit the task of data cleaning and formalize it as either a ranking problem, which significantly reduces human inspection effort, or a scoring problem, which allows for automated decisions based on score distributions.We find that a specific combination of context-aware self-supervised representation learning and distance-based indicators is effective in finding issues without annotation biases.This methodology, which we call SelfClean, surpasses state-of-the-art performance in detecting off-topic images, near duplicates, and label errors within widely-used image datasets, such as ImageNet-1k, Food-101N, and STL-10, both for synthetic issues and real contamination.We apply the detailed method to multiple image benchmarks, identify up to 16% of issues, and confirm an improvement in evaluation reliability upon cleaning.The official implementation can be found at: https://github.com/Digital-Dermatology/SelfClean.

🧭 Keyword Pioneer — data cleaning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

Authors

Fabian Gröger , Simone Lionetti , Philippe Gottfrois , Alvaro Gonzalez-Jimenez , Ludovic Amruthalingam , Labelling Consortium , Matthew Groh , Alexander A. Navarini , Marc Pouly

Topics

Machine Learning > Core Methods > Representation Learning Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Learning Types > Unsupervised Learning Machine Learning > Application Areas > Data Augmentation Computer Vision > Analysis > Anomaly Detection Deep Learning > Learning Types > Self-Supervised Learning Deep Learning > Learning Types > Data Augmentation Machine Learning > Core Methods > Anomaly Detection

Keywords

representation learning anomaly detection self-supervised learning benchmark dataset data quality data cleaning label error detection near duplicate detection

Download PDF

Related papers

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers 2024

Training for Stable Explanation for Free 2024

NeuralSolver: Learning Algorithms For Consistent and Efficient Extrapolation Across General Tasks 2024

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch 2024

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence 2024