Towards Robust Interpretability with Self-Explaining Neural Networks

Tommi Jaakkola; David Alvarez-Melis; David Alvarez Melis

2018 NIPS NeurIPS 2018

Towards Robust Interpretability with Self-Explaining Neural Networks

Abstract

Most recent work on interpretability of complex machine learning models has focused on estimating a-posteriori explanations for previously trained models around specific predictions. Self-explaining models where interpretability plays a key role already during learning have received much less attention. We propose three desiderata for explanations in general -- explicitness, faithfulness, and stability -- and show that existing methods do not satisfy them. In response, we design self-explaining models in stages, progressively generalizing linear classifiers to complex yet architecturally explicit models. Faithfulness and stability are enforced via regularization specifically tailored to such models. Experimental results across various benchmark datasets show that our framework offers a promising direction for reconciling model complexity and interpretability.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — explanation faithfulness

🐣 Hot Topic Early Bird — feature attribution

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

David Alvarez-Melis , David Alvarez Melis , Tommi Jaakkola

Topics

Artificial Intelligence > Core AI > Interpretability Machine Learning > Core Methods > Classification Deep Learning > Architectures > Neural Networks

Keywords

feature attribution explanation faithfulness neural network self-explaining model explanation stability

Download PDF

Related papers

Maximum Causal Tsallis Entropy Imitation Learning 2018

Recurrent World Models Facilitate Policy Evolution 2018

Bandit Learning in Concave N-Person Games 2018

Algorithmic Assurance: An Active Approach to Algorithmic Testing using Bayesian Optimisation 2018

PAC-Bayes bounds for stable algorithms with instance-dependent priors 2018