An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

Jett Janiak; Can Rager; James Dao; Yeu-Tong Lau

2024 EMNLP EMNLP 2024

An Adversarial Example for Direct Logit Attribution: Memory Management in GELU-4L

Abstract

AbstractPrior work suggests that language models manage the limited bandwidth of the residual stream through a “memory management” mechanism, where certain attention heads and MLP layers clear residual stream directions set by earlier layers. Our study provides concrete evidence for this erasure phenomenon in a 4-layer transformer, identifying heads that consistently remove the output of earlier heads. We further demonstrate that direct logit attribution (DLA), a common technique for interpreting the output of intermediate transformer layers, can show misleading results by not accounting for erasure.

🧭 Keyword Pioneer — direct logit attribution

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Jett Janiak , Can Rager , James Dao , Yeu-Tong Lau

Topics

Artificial Intelligence > Core AI > Interpretability Artificial Intelligence > Core AI > Memory

Keywords

attention head residual stream memory management direct logit attribution

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024