Sycophancy Hides Linearly in the Attention Heads

Rifo Ahmad Genadi; Munachiso Samuel Nwadike; Nurdaulet Mukhituly; Tatsuya Hiraoka; Hilal AlQuabeh; Kentaro Inui

2026 EACL EACL 2026

Sycophancy Hides Linearly in the Attention Heads

Abstract

AbstractWe find that correct-to-incorrect sycophancy signals are most linearly accessible within multi-head attention activations. Motivated by the linear representation hypothesis, we train linear probes across the residual stream, multilayer perceptron (MLP), and attention layers to analyze where these signals emerge. Although separability appears in the residual stream and MLPs, steering using these probes is most effective in a sparse subset of middle-layer attention heads. Using TruthfulQA as the base dataset, we find that probes trained on it transfer effectively to other factual QA benchmarks. Furthermore, comparing our discovered direction to previously identified “truthful” directions reveals limited overlap, suggesting that factual accuracy, and deference resistance, arise from related but distinct mechanisms. Attention-pattern analysis further indicates that the influential heads attend disproportionately to expressions of user doubt, contributing to sycophantic shifts. Overall, these findings suggest that sycophancy can be mitigated through simple, targeted linear interventions that exploit the internal geometry of attention activations. Code will be released upon publication.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — representation hypothesis

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Security & Privacy, Speech & Audio

Authors

Rifo Ahmad Genadi , Munachiso Samuel Nwadike , Nurdaulet Mukhituly , Tatsuya Hiraoka , Hilal AlQuabeh , Kentaro Inui

Topics

Artificial Intelligence > Core AI > Interpretability Machine Learning > Core Methods > Representation Learning

Keywords

attention head residual stream linear probe representation hypothesis

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026