Tracr: Compiled Transformers as a Laboratory for Interpretability

David Lindner; János Kramár; Sebastian Farquhar; Matthew Rahtz; Tom McGrath; Vladimir Mikulik

2023 NIPS NeurIPS 2023

Tracr: Compiled Transformers as a Laboratory for Interpretability

Abstract

We show how to "compile" human-readable programs into standard decoder-only transformer models. Our compiler, Tracr, generates models with known structure. This structure can be used to design experiments. For example, we use it to study "superposition" in transformers that execute multi-step algorithms. Additionally, the known structure of Tracr-compiled models can serve as ground-truth for evaluating interpretability methods. Commonly, because the "programs" learned by transformers are unknown it is unclear whether an interpretation succeeded. We demonstrate our approach by implementing and examining programs including computing token frequencies, sorting, and parenthesis checking. We provide an open-source implementation of Tracr at https://github.com/google-deepmind/tracr.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🧭 Keyword Pioneer — mechanistic analysis

🐣 Hot Topic Early Bird — mechanistic interpretability

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

David Lindner , János Kramár , Sebastian Farquhar , Matthew Rahtz , Tom McGrath , Vladimir Mikulik

Topics

Artificial Intelligence > Core AI > Interpretability Deep Learning > Architectures > Transformers Deep Learning > Models > Language Models

Keywords

model architecture model analysis mechanistic interpretability attention analysis transformer interpretability neural network mechanistic analysis program compilation compiled model

Download PDF

Related papers

Risk-Averse Model Uncertainty for Distributionally Robust Safe Reinforcement Learning 2023

Generative Modeling through the Semi-dual Formulation of Unbalanced Optimal Transport 2023

Self-Supervised Motion Magnification by Backpropagating Through Optical Flow 2023

Diffused Task-Agnostic Milestone Planner 2023

Characterizing Graph Datasets for Node Classification: Homophily-Heterophily Dichotomy and Beyond 2023