Adrià Garriga-Alonso

10 papers · 2019–2025 · 3 conferences · across top CS/AI conferences

Achievements

🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🌈 Renaissance Researcher (5) 🗺️ Taxonomy Completionist (21) 🐣 Hot Topic Early Bird 🌍 Conference Polyglot (3) 🏃 Academic Marathon (6) 🐝 Cross-Pollinator (11) 🏆 Keyword Champion (2) 💎 Century Club (10) 🔥 Unstoppable (5)

Conferences

NIPS (5) ICLR (3) UAI (2)

Top co-authors

Laurence Aitchison (3) Mark van der Wilk (3) Vincent Fortuin (2) Thomas Kwa (2) Aengus Lynch (2) Thomas Bush (1) David Krueger (1) Achille Nazaret (1) Daniel Tan (1) Florian Wenzel (1)

Keywords

mechanistic interpretability (3) circuit discovery (2) neural network analysis (2) language model (2) policy optimization (1) data augmentation (1) kl divergence (1) model behavior (1) model interpretability (1) reinforcement learning from human feedback (1) hypothesis testing (1) convolutional neural network (1) heavy-tailed distribution (1) reward misspecification (1) reward hacking (1) circuit analysis (1) neural network verification (1) bayesian neural network (1) steering vector (1) causal model (1)

Papers

Interpreting Emergent Planning in Model-Free Reinforcement Learning ICLR 2025

Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification NIPS 2024

InterpBench: Semi-Synthetic Transformers for Evaluating Mechanistic Interpretability Techniques NIPS 2024

Hypothesis Testing the Circuit Hypothesis in LLMs NIPS 2024

Analysing the Generalisation and Reliability of Steering Vectors NIPS 2024

Towards Automated Circuit Discovery for Mechanistic Interpretability NIPS 2023

Data augmentation in Bayesian neural networks and the cold posterior effect UAI 2022

Bayesian Neural Network Priors Revisited ICLR 2022

Correlated weights in infinite limits of deep convolutional neural networks UAI 2021

Deep Convolutional Networks as shallow Gaussian Processes ICLR 2019