AustroTox: A Dataset for Target-Based Austrian German Offensive Language Detection

Pia Pachinger; Janis Goldzycher; Anna Planitzer; Wojciech Kusa; Allan Hanbury; Julia Neidhardt

2024 ACL ACL 2024

AustroTox: A Dataset for Target-Based Austrian German Offensive Language Detection

Abstract

AbstractModel interpretability in toxicity detection greatly profits from token-level annotations. However, currently, such annotations are only available in English. We introduce a dataset annotated for offensive language detection sourced from a news forum, notable for its incorporation of the Austrian German dialect, comprising 4,562 user comments. In addition to binary offensiveness classification, we identify spans within each comment constituting vulgar language or representing targets of offensive statements. We evaluate fine-tuned Transformer models as well as large language models in a zero- and few-shot fashion. The results indicate that while fine-tuned models excel in detecting linguistic peculiarities such as vulgar dialect, large language models demonstrate superior performance in detecting offensiveness in AustroTox.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — austrian german dialect

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Pia Pachinger , Janis Goldzycher , Anna Planitzer , Wojciech Kusa , Allan Hanbury , Julia Neidhardt

Topics

Natural Language Processing > Applications > Text Classification Natural Language Processing > Resources & Methods > Large Language Models Artificial Intelligence > Learning Paradigms > Zero-Shot Learning Deep Learning > Learning Types > Zero-Shot Learning Machine Learning > Learning Types > Multi-Lingual Learning

Keywords

zero-shot learning few-shot learning offensive language detection token-level annotation large language model transformer model austrian german dialect

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024