Nora Belrose

5 papers · 2023–2025 · 3 conferences · across top CS/AI conferences

Achievements

🌍 Conference Polyglot (3) 🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🐝 Cross-Pollinator (15) ❓ The Questioner

Conferences

ICML (3) AAAI (1) NIPS (1)

Top co-authors

Alex Troy Mallen (2) Lucia Quirke (1) David Schneider-Joseph (1) Ryan Cotterell (1) Shauli Ravfogel (1) Adam Gleave (1) Yawen Duan (1) Sergey Levine (1) Tom Tseng (1) Michael D Dennis (1)

Keywords

representation learning (1) game playing (1) adversarial attack (1) linear classifier (1) recurrent neural network (1) language model (1) zero-shot transfer (1) activation manipulation (1) concept erasure (1) bias reduction (1) interpretability method (1) model steering (1) transformer model (1) adversarial policies (1) agent vulnerability (1) activation addition (1)

Papers

Do Transformer Interpretability Methods Transfer to RNNs? AAAI 2025

Automatically Interpreting Millions of Features in Large Language Models ICML 2025

Neural Networks Learn Statistics of Increasing Complexity ICML 2024

LEACE: Perfect linear concept erasure in closed form NIPS 2023

Adversarial Policies Beat Superhuman Go AIs ICML 2023