Feature Drift: How Fine-Tuning Repurposes Representations in LLMs

Andrey V. Galichin; Anton Korznikov; Alexey Dontsov; Oleg Rogov; Elena Tutubalina; Ivan Oseledets

2026 EACL EACL 2026

Feature Drift: How Fine-Tuning Repurposes Representations in LLMs

Abstract

AbstractFine-tuning LLMs introduces many important behaviors, such as instruction-following and safety alignment. This makes it crucial to study how fine-tuning changes models’ internal mechanisms. Sparse Autoencoders (SAEs) offer a powerful tool for interpreting neural networks by extracting concepts (features) represented in their activations. Previous work observed that SAEs trained on base models transfer effectively to instruction-tuned (chat) models, attributed to activation similarity. In this work, we propose *feature drift* as an alternative explanation: the feature space remains relevant, but the distribution of feature activations changes. In other words, fine-tuning recombines existing concepts rather than learning new ones. We validate this by showing base SAEs reconstruct both base and chat activations comparably despite systematic differences, with individual features exhibiting clear drift patterns. In a refusal behavior case study, we identify base SAE features that drift to activate on harmful instructions in chat models. Causal interventions using these features confirm that they mediate refusal. Our findings suggest that monitoring how existing features drift, rather than searching for entirely new features, may provide a more complete explanation of how fine-tuning changes model capabilities.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Andrey V. Galichin , Anton Korznikov , Alexey Dontsov , Oleg Rogov , Elena Tutubalina , Ivan Oseledets

Topics

Artificial Intelligence > Core AI > Interpretability Machine Learning > Core Methods > Representation Learning Deep Learning > Techniques > Pretraining

Keywords

feature representation language model sparse autoencoder causal intervention

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026