Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large Language Models

Francisco Eiras; Aleksandar Petrov; Philip Torr; M. Pawan Kumar; Adel Bibi

2025 ICLR ICLR 2025

Do as I do (Safely): Mitigating Task-Specific Fine-tuning Risks in Large Language Models

Abstract

Recent research shows that fine-tuning on benign instruction-following data can inadvertently undo the safety alignment process and increase a model's propensity to comply with harmful queries. While instruction-following fine-tuning is important, task-specific fine-tuning-where models are trained on datasets with clear ground truth answers (e.g., multiple choice questions)-can enhance model performance on specialized downstream tasks. Understanding and mitigating safety risks in the task-specific setting remains distinct from the instruction-following context due to structural differences in the data. Our work demonstrates how malicious actors can subtly manipulate the structure of almost *any* task-specific dataset to foster significantly more dangerous model behaviors, while maintaining an appearance of innocuity and reasonable downstream task performance. To address this issue, we propose a novel mitigation strategy that mixes in safety data which *mimics* the task format and prompting style of the user data, showing this is significantly more effective and efficient than existing baselines at re-establishing safety alignment while maintaining similar task performance.

Authors

Francisco Eiras , Aleksandar Petrov , Philip Torr , M. Pawan Kumar , Adel Bibi

Download PDF

Related papers

Gramian Multimodal Representation Learning and Alignment 2025

Separation Power of Equivariant Neural Networks 2025

What should a neuron aim for? Designing local objective functions based on information theory 2025

Regret-Optimal List Replicable Bandit Learning: Matching Upper and Lower Bounds 2025

CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL 2025