IvRA: A Framework to Enhance Attention-Based Explanations for Language Models with Interpretability-Driven Training

Sean Xie; Soroush Vosoughi; Saeed Hassanpour

2024 EMNLP EMNLP 2024

IvRA: A Framework to Enhance Attention-Based Explanations for Language Models with Interpretability-Driven Training

Abstract

AbstractAttention has long served as a foundational technique for generating explanations. With the recent developments made in Explainable AI (XAI), the multi-faceted nature of interpretability has become more apparent. Can attention, as an explanation method, be adapted to meet the diverse needs that our expanded understanding of interpretability demands? In this work, we aim to address this question by introducing IvRA, a framework designed to directly train a language model’s attention distribution through regularization to produce attribution explanations that align with interpretability criteria such as simulatability, faithfulness, and consistency. Our extensive experimental analysis demonstrates that IvRA outperforms existing methods in guiding language models to generate explanations that are simulatable, faithful, and consistent, in tandem with their predictions. Furthermore, we perform ablation studies to verify the robustness of IvRA across various experimental settings and to shed light on the interactions among different interpretability criteria.

🧭 Keyword Pioneer — interpretability criterion

🐝 Cross-Pollinator — Artificial Intelligence, Deep Learning, Machine Learning

Authors

Sean Xie , Soroush Vosoughi , Saeed Hassanpour

Topics

Artificial Intelligence > Core AI > Interpretability

Keywords

attention explanation interpretability criterion

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024