Use perturbations when learning from explanations

Juyeon Heo; Vihari Piratla; Matthew Wicker; Adrian Weller

2023 NIPS NeurIPS 2023

Use perturbations when learning from explanations

Abstract

Machine learning from explanations (MLX) is an approach to learning that uses human-provided explanations of relevant or irrelevant features for each input to ensure that model predictions are right for the right reasons. Existing MLX approaches rely on local model interpretation methods and require strong model smoothing to align model and human explanations, leading to sub-optimal performance. We recast MLX as a robustness problem, where human explanations specify a lower dimensional manifold from which perturbations can be drawn, and show both theoretically and empirically how this approach alleviates the need for strong model smoothing. We consider various approaches to achieving robustness, leading to improved performance over prior MLX methods. Finally, we show how to combine robustness with an earlier MLX method, yielding state-of-the-art results on both synthetic and real-world benchmarks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — machine learning from explanation

🐣 Hot Topic Early Bird — model alignment

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Juyeon Heo , Vihari Piratla , Matthew Wicker , Adrian Weller

Topics

Artificial Intelligence > Core AI > Interpretability Machine Learning > Core Methods > Representation Learning Machine Learning > Learning Types > Adversarial Learning Machine Learning > Learning Types > Self-Supervised Learning Machine Learning > Learning Types > Weakly Supervised Learning Machine Learning > Optimization & Theory > Optimization Machine Learning > Learning Types > Interpretability Machine Learning > Learning Types > Robustness

Keywords

adversarial learning robust optimization adversarial robustness weakly supervised learning feature attribution model alignment perturbation analysis model interpretation robustness optimization human explanation machine learning from explanation learning from explanation

Download PDF

Related papers

Risk-Averse Model Uncertainty for Distributionally Robust Safe Reinforcement Learning 2023

Generative Modeling through the Semi-dual Formulation of Unbalanced Optimal Transport 2023

Self-Supervised Motion Magnification by Backpropagating Through Optical Flow 2023

Diffused Task-Agnostic Milestone Planner 2023

Characterizing Graph Datasets for Node Classification: Homophily-Heterophily Dichotomy and Beyond 2023