Delving into Qualitative Implications of Synthetic Data for Hate Speech Detection

Camilla Casula; Sebastiano Vecellio Salto; Alan Ramponi; Sara Tonelli

2024 EMNLP EMNLP 2024

Delving into Qualitative Implications of Synthetic Data for Hate Speech Detection

Abstract

AbstractThe use of synthetic data for training models for a variety of NLP tasks is now widespread. However, previous work reports mixed results with regards to its effectiveness on highly subjective tasks such as hate speech detection. In this paper, we present an in-depth qualitative analysis of the potential and specific pitfalls of synthetic data for hate speech detection in English, with 3,500 manually annotated examples. We show that, across different models, synthetic data created through paraphrasing gold texts can improve out-of-distribution robustness from a computational standpoint. However, this comes at a cost: synthetic data fails to reliably reflect the characteristics of real-world data on a number of linguistic dimensions, it results in drastically different class distributions, and it heavily reduces the representation of both specific identity groups and intersectional hate.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Camilla Casula , Sebastiano Vecellio Salto , Alan Ramponi , Sara Tonelli

Topics

Artificial Intelligence > Core AI > Interpretability Machine Learning > Application Areas > Domain Generalization Machine Learning > Application Areas > Fairness Deep Learning > Learning Types > Data Augmentation Artificial Intelligence > Core AI > Natural Language Processing Deep Learning > Learning Types > Robustness

Keywords

data augmentation synthetic datum out-of-distribution robustness class distribution hate speech detection qualitative analysis

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024