Always Refuse: Steering LLMs Against Jailbreaks with Contrastive Activations (Student Abstract)

Abhilekh Borah; Niranjan Chebrolu; Kokil Jaidka

2026 AAAI AAAI 2026

Always Refuse: Steering LLMs Against Jailbreaks with Contrastive Activations (Student Abstract)

Abstract

Abstract Refusals must be resilient, not brittle.” Yet guarding refusals against adversarial phrasing and shifting user contexts remains difficult: large language models (LLMs) still yield to jailbreak prompts that evade safety filters and surface harmful content. We propose Refusal Activation Steering (RAS), a training-free, inference-time method that uses contrastive activations to shift LLM responses, biasing generation trajectories toward refusals without altering model weights. The approach is modular and domain-targetable, avoiding collateral refusals on benign queries while strengthening activation- space boundaries for unsafe content. On adversarial evaluations with an 8B instruction-tuned model, we find that steering improves refusal rate by ∼ 52% and reduces attack success rate by ∼ 40%, establishing a lightweight and interpretable safety layer for robust refusal consistency. To foster further research in this domain, we have made our implementation publicly available.

🧭 Keyword Pioneer — refusal steering

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy

Authors

Abhilekh Borah , Niranjan Chebrolu , Kokil Jaidka

Topics

Artificial Intelligence > Core AI > AI Safety Artificial Intelligence > Core AI > Interpretability

Keywords

jailbreak attack llm safety activation space refusal steering contrastive activation

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026