2025 ACL ACL 2025

Are Bias Evaluation Methods Biased ?

Abstract

AbstractThe creation of benchmarksto evaluate the safety of Large Language Models is one of the key activities within the trusted AI community. These benchmarks allow models to be compared for different aspects of safety such as toxicity, bias, harmful behavior etc. Independent benchmarks adopt different approacheswith distinct data sets and evaluation methods. We investigate how robust such benchmarks are by using different approachesto rank a set of representative models for bias andcompare how similar are the overall rankings. We show that different but widely used bias evaluations methods result in disparate model rankings. We conclude with recommendations for the community in the usage of such benchmarks.

The Questioner
🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio