A Robust Unlearning Method with Adaptive Knowledge Guidance and Memory Preservation

Jingyuan Tian; Xiaofei Zhou

2026 AAAI AAAI 2026

A Robust Unlearning Method with Adaptive Knowledge Guidance and Memory Preservation

Abstract

Abstract Machine unlearning has emerged as a promising approach to remove specific knowledge from large language models (LLMs), especially for safety-critical applications. However, existing representation-based methods lack guidance for selecting representation locations to unlearn (RMU), thus lacking precision in unlearning, while probability-based methods are vulnerable to fine-tuning attacks which use unrelated and safe data to fine-tune models. To address these problems, this paper presents an adaptive knowledge guidance and memory perturbation mechanisms, called ALMPU (Adaptive Localized Memory Perturbation Unlearning) which addresses the lack of knowledge guidance in representation-based unlearning methods and mitigates the impact of fine-tuning attacks on unlearned models. Specifically, we apply scaling factors to attention heads and select the most sensitive ones as knowledge guidance. Guided by the previous knowledge localization, we integrate enhanced memory perturbation—which forces the model to preserve specific knowledge—into the standard representation-based unlearning process at these sensitive positions. Through this perturbation mechanism, the model achieves more thorough elimination of the target knowledge. By adding interventions to selected attention heads and explicitly optimizing against fine-tuning attacks during the unlearning process, ALMPU creates a controlled divergence from the original model that is inherently resistant to relearning attempts. Experimental evaluation on the WMDP benchmark demonstrates that ALMPU consistently outperforms baseline methods across different scales of fine-tuning attacks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Jingyuan Tian , Xiaofei Zhou

Topics

Artificial Intelligence > Core AI > Responsible AI Natural Language Processing > Resources & Methods > Knowledge Editing

Keywords

model safety machine unlearning attention head knowledge removal fine-tuning attack memory perturbation

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026