Yushi Yang

2 papers · 2025–2025 · 1 conference · across top CS/AI conferences

Achievements

🌉 Interdisciplinary Bridge 🐝 Cross-Pollinator (15) ❓ The Questioner

Conferences

EMNLP (2)

Top co-authors

Harry Mayne (2) Adam Mahdi (2) Andrew M. Bean (1) Filip Sondej (1) Ryan Othniel Kearns (1) Chris Russell (1) Eoin D. Delaney (1) Andrew Lee (1)

Keywords

direct preference optimization (1) model behavior (1) neural network analysis (1) ai safety (1) language model (1) decision boundary (1) model explanation (1) counterfactual explanation (1) mechanistic interpretability (1) safety fine-tuning (1) activation editing (1) neuron analysis (1) toxicity reduction (1) large language model (1) language model safety (1) self-generated explanation (1)

Papers

LLMs Don’t Know Their Own Decision Boundaries: The Unreliability of Self-Generated Counterfactual Explanations EMNLP 2025

How Does DPO Reduce Toxicity? A Mechanistic Neuron-Level Analysis EMNLP 2025