Detection of Abusive Language: the Problem of Biased Datasets

Michael Wiegand; Josef Ruppenhofer; Thomas Kleinbauer

2019 NAACL NAACL 2019

Detection of Abusive Language: the Problem of Biased Datasets

Abstract

AbstractWe discuss the impact of data bias on abusive language detection. We show that classification scores on popular datasets reported in previous work are much lower under realistic settings in which this bias is reduced. Such biases are most notably observed on datasets that are created by focused sampling instead of random sampling. Datasets with a higher proportion of implicit abuse are more affected than datasets with a lower proportion.

🐣 Hot Topic Early Bird — dataset bia

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio