Dmitrii Krasheninnikov

4 papers · 2019–2024 · 3 conferences · across top CS/AI conferences

Achievements

🌉 Interdisciplinary Bridge 🧭 Keyword Pioneer 🌍 Conference Polyglot (3) 🏃 Academic Marathon (5) 🐝 Cross-Pollinator (12) 🗺️ Taxonomy Completionist (11) 🐣 Hot Topic Early Bird

Conferences

NIPS (2) ICLR (1) ICML (1)

Top co-authors

David Krueger (3) Egor Krasheninnikov (1) Fabien Roger (1) Bruno Kacper Mlodozeniec (1) Rohin Shah (1) Tegan Maharaj (1) Ryan Greenblatt (1) Pieter Abbeel (1) Nikolaus Howe (1) Anca Dragan (1)

Keywords

reinforcement learning (2) model evaluation (1) model safety (1) reward function (1) reward hacking (1) capability elicitation (1) password-locked model (1) safety evaluation (1) hidden capability (1) deterministic policy (1) stochastic policy (1) llm alignment (1) large language model (1) proxy reward (1) dangerous capability (1) unhackable proxy (1) fine-tuning evaluation (1)

Papers

Stress-Testing Capability Elicitation With Password-Locked Models NIPS 2024

Implicit meta-learning may lead language models to trust more reliable sources ICML 2024

Defining and Characterizing Reward Gaming NIPS 2022

Preferences Implicit in the State of the World ICLR 2019