LLMs to Replace Crowdsourcing For Parallel Data Creation? The Case of Text Detoxification

Daniil Moskovskiy; Sergey Pletenev; Alexander Panchenko

2024 EMNLP EMNLP 2024

LLMs to Replace Crowdsourcing For Parallel Data Creation? The Case of Text Detoxification

Abstract

AbstractThe lack of high-quality training data remains a significant challenge in NLP. Manual annotation methods, such as crowdsourcing, are costly, require intricate task design skills, and, if used incorrectly, may result in poor data quality. From the other hand, LLMs have demonstrated proficiency in many NLP tasks, including zero-shot and few-shot data annotation. However, they often struggle with text detoxification due to alignment constraints and fail to generate the required detoxified text. This work explores the potential of modern open source LLMs to annotate parallel data for text detoxification. Using the recent technique of activation patching, we generate a pseudo-parallel detoxification dataset based on ParaDetox. The detoxification model trained on our generated data shows comparable performance to the original dataset in automatic detoxification evaluation metrics and superior quality in manual evaluation and side-by-side comparisons.

❓ The Questioner

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — parallel data creation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Daniil Moskovskiy , Sergey Pletenev , Alexander Panchenko

Topics

Machine Learning > Application Areas > Data Augmentation Natural Language Processing > Generation > Text Generation Natural Language Processing > Applications > Text Generation Artificial Intelligence > Core AI > Natural Language Processing Natural Language Processing > Applications > Text Processing

Keywords

data augmentation text generation data annotation language model parallel datum text detoxification activation patching parallel data creation language model annotation

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024