Understanding Bias in Large-Scale Visual Datasets

Boya Zeng; Yida Yin; Zhuang Liu

2024 NIPS NeurIPS 2024

Understanding Bias in Large-Scale Visual Datasets

Abstract

A recent study has shown that large-scale visual datasets are very biased: they can be easily classified by modern neural networks. However, the concrete forms of bias among these datasets remain unclear. In this study, we propose a framework to identify the unique visual attributes distinguishing these datasets. Our approach applies various transformations to extract semantic, structural, boundary, color, and frequency information from datasets, and assess how much each type of information reflects their bias. We further decompose their semantic bias with object-level analysis, and leverage natural language methods to generate detailed, open-ended descriptions of each dataset's characteristics. Our work aims to help researchers understand the bias in existing large-scale pre-training datasets, and build more diverse and representative ones in the future. Our project page and code are available at boyazeng.github.io/understand_bias.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — visual dataset

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Boya Zeng , Yida Yin , Zhuang Liu

Topics

Machine Learning > Core Methods > Representation Learning Machine Learning > Application Areas > Fairness Deep Learning > Architectures > Neural Networks Computer Vision > Analysis > Scene Understanding Deep Learning > Learning Types > Self-Supervised Learning Deep Learning > Learning Types > Deep Learning Machine Learning > Learning Types > Fairness Deep Learning > Learning Types > Representation Learning Computer Vision > Analysis > Computer Vision

Keywords

computer vision data augmentation semantic analysis bias detection semantic information dataset bia visual dataset structural information pretrained model large-scale dataset dataset analysis neural network semantic bia visual bia

Download PDF

Related papers

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers 2024

Training for Stable Explanation for Free 2024

NeuralSolver: Learning Algorithms For Consistent and Efficient Extrapolation Across General Tasks 2024

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch 2024

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence 2024