A Large-scale Comprehensive Abusiveness Detection Dataset with Multifaceted Labels from Reddit

Hoyun Song; Soo Hyun Ryu; Huije Lee; Jong Park

2021 CONLL CoNLL 2021

A Large-scale Comprehensive Abusiveness Detection Dataset with Multifaceted Labels from Reddit

Abstract

AbstractAs users in online communities suffer from severe side effects of abusive language, many researchers attempted to detect abusive texts from social media, presenting several datasets for such detection. However, none of them contain both comprehensive labels and contextual information, which are essential for thoroughly detecting all kinds of abusiveness from texts, since datasets with such fine-grained features demand a significant amount of annotations, leading to much increased complexity. In this paper, we propose a Comprehensive Abusiveness Detection Dataset (CADD), collected from the English Reddit posts, with multifaceted labels and contexts. Our dataset is annotated hierarchically for an efficient annotation through crowdsourcing on a large-scale. We also empirically explore the characteristics of our dataset and provide a detailed analysis for novel insights. The results of our experiments with strong pre-trained natural language understanding models on our dataset show that our dataset gives rise to meaningful performance, assuring its practicality for abusive language detection.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — abusiveness detection

🐣 Hot Topic Early Bird — dataset creation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hoyun Song , Soo Hyun Ryu , Huije Lee , Jong Park

Topics

Machine Learning > Core Methods > Classification Natural Language Processing > Applications > Text Classification

Keywords

dataset creation multi-label classification natural language understanding abusiveness detection hierarchical labeling

Download PDF

Related papers

BabyBERTa: Learning More Grammar With Small-Scale Child-Directed Language 2021

“It’s our fault!”: Insights Into Users’ Understanding and Interaction With an Explanatory Collaborative Dialog System 2021

VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual Question Answering 2021

“It seemed like an annoying woman”: On the Perception and Ethical Considerations of Affective Language in Text-Based Conversational Agents 2021

Generalising to German Plural Noun Classes, from the Perspective of a Recurrent Neural Network 2021