In Data We Trust: A Critical Analysis of Hate Speech Detection Datasets

Kosisochukwu Madukwe; Xiaoying Gao; Bing Xue

2020 EMNLP EMNLP 2020

In Data We Trust: A Critical Analysis of Hate Speech Detection Datasets

Abstract

AbstractRecently, a few studies have discussed the limitations of datasets collected for the task of detecting hate speech from different viewpoints. We intend to contribute to the conversation by providing a consolidated overview of these issues pertaining to the data that debilitate research in this area. Specifically, we discuss how the varying pre-processing steps and the format for making data publicly available result in highly varying datasets that make an objective comparison between studies difficult and unfair. There is currently no study (to the best of our knowledge) focused on comparing the attributes of existing datasets for hate speech detection, outlining their limitations and recommending approaches for future research. This work intends to fill that gap and become the one-stop shop for information regarding hate speech datasets.

🐣 Hot Topic Early Bird — data quality

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Kosisochukwu Madukwe , Xiaoying Gao , Bing Xue

Topics

Machine Learning > Core Methods > Classification Machine Learning > Application Areas > Fairness

Keywords

natural language processing text classification data quality hate speech detection dataset analysis

Download PDF

Related papers

Fast semantic parsing with well-typedness guarantees 2020

Detecting Objectifying Language in Online Professor Reviews 2020

Analogous Process Structure Induction for Sub-event Sequence Prediction 2020

Aspect Sentiment Classification with Aspect-Specific Opinion Spans 2020

Robust and Interpretable Grounding of Spatial References with Relation Networks 2020