Detoxifying Large Language Models via the Diversity of Toxic Samples

Ying Zhao; Yuanzhao Guo; Xuemeng Weng; Yuan Tian; Wei Wang; Yi Chang

2025 EMNLP EMNLP 2025

Detoxifying Large Language Models via the Diversity of Toxic Samples

Abstract

AbstractEliminating toxicity from Large Language Models (LLMs) is crucial for ensuring user safety. However, current methods have limitations in the analysis and utilization of toxic samples, failing to fully harness their potential. Through comparative analysis of toxic and safe samples, we discover that toxic samples exhibit diversity and, within this diversity, there lies specificity. These findings suggest that leveraging these characteristics of toxic samples could enhance the performance of algorithms in detoxifying LLMs. To this end, we propose a novel diverse detoxification framework, DivDetox, which comprises two innovative components: a Multi-Category-Induced Personalized Sample Generation (MPSG) strategy and a Scaled Contrastive DPO (SC-DPO) approach. The former is designed to elicit a variety of personalized toxic responses from the LLM, while the latter is constructed to precisely and fully utilize these toxic responses. Experiments on benchmark datasets across different model scales and different detoxification tasks verify the effectiveness of our architecture.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — llm detoxification

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ying Zhao , Yuanzhao Guo , Xuemeng Weng , Yuan Tian , Wei Wang , Yi Chang

Topics

Artificial Intelligence > Core AI > AI Safety Artificial Intelligence > Core AI > Responsible AI Machine Learning > Learning Types > Contrastive Learning Natural Language Processing > Resources & Methods > Large Language Models Machine Learning > Learning Types > Reinforcement Learning Artificial Intelligence > Core AI > Large Language Models Deep Learning > Techniques > Contrastive Learning

Keywords

contrastive learning direct preference optimization diversity sampling toxicity reduction model detoxification sample diversity large language model llm detoxification toxic sample diversity user safety

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025