Making Sense of LLM Decisions: A Prototype-based Framework for Explainable Classification

Bowen Wei; Mehrdad Fazli; Ziwei Zhu

2026 AAAI AAAI 2026

Making Sense of LLM Decisions: A Prototype-based Framework for Explainable Classification

Abstract

Abstract Large language models have demonstrated impressive performance on natural language tasks, but their decision-making processes remain opaque. Existing explanation methods either suffer from limited faithfulness to the model's reasoning or produce explanations that are difficult for humans to understand. To address these challenges, we propose ProtoSurE, a novel prototype-based surrogate framework that provides faithful and understandable explanations for LLMs. ProtoSurE trains an interpretable-by-design surrogate model that aligns with the target LLM while utilizing sentence-level prototypes as understandable concepts. Extensive experiments show that ProtoSurE consistently outperforms state-of-the-art explanation methods across diverse LLMs and datasets. Importantly, ProtoSurE demonstrates strong data efficiency, requiring relatively few training examples to achieve good performance, making it practical for real-world applications.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🧭 Keyword Pioneer — sentence-level prototype

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Bowen Wei , Mehrdad Fazli , Ziwei Zhu

Topics

Artificial Intelligence > Core AI > Interpretability Natural Language Processing > Resources & Methods > Large Language Models

Keywords

surrogate model faithful explanation large language model explainable classification prototype-based explanation sentence-level prototype

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026