2026 AAAI AAAI 2026

From Chaos to Cure: A Prefix Heuristics Guided Model-Agnostic Adaptive Detoxification Framework

Abstract

Abstract The impressive performance of large language models (LLMs) also brings inherent toxicity risks, prompting the need for effective detoxification to support responsible deployment. Prevailing methods generally follow an inflexible model-specific fashion, addressing only individual models or model families. Moreover, overlooking the underlying toxic risks involved in the input prefix can lead to toxic accumulation during autoregressive generation. Existing methods rely on external strong attribute interventions to address this issue, which further exacerbates contextual semantic inconsistencies and makes it difficult to balance toxicity efficacy and generation quality. To address these concerns, we propose a novel Model-Agnostic Adaptive Detoxification (MAAD) framework. To address accumulating toxicity, we present prefix heuristics that serve as contextual signals, guiding the base LLM toward safer generation. Along this line, we construct an antidote dataset to support a lightweight model, Detoxifier, which steers the base LLM to make in-scope and reliable detoxifying distribution adjustments while preserving fluency and contextual understanding. Designed as an easy-to-deploy module, Detoxifier requires a small amount of data and can be seamlessly applied to various base LLMs with one-off training. Since over-purifying often reduces diversity, we also propose a dynamic truncation method called CW-cutoff sampling to trade off language model quality and diversity. Extensive experiments demonstrate that MAAD strikes a better balance between detoxification effectiveness and generation quality, while also maintaining model utility.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing
🧭 Keyword Pioneer — prefix heuristic
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio