From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench

Weikang Shi; Houxing Ren; Junting Pan; Aojun Zhou; Ke Wang; Zimu Lu; Yunqiao Yang; Yuxuan Hu; Linda Wei; Mingjie Zhan; hongsheng Li

2026 AAAI AAAI 2026

From Solver to Tutor: Evaluating the Pedagogical Intelligence of LLMs with KMP-Bench

Abstract

Abstract Large Language Models (LLMs) show significant potential in AI mathematical tutoring, yet current evaluations often rely on simplistic metrics or narrow pedagogical scenarios, failing to assess comprehensive, multi-turn teaching effectiveness. In this paper, we introduce KMP-Bench, a comprehensive K-8 Mathematical Pedagogical Benchmark designed to assess LLMs from two complementary perspectives. The first module, KMP-Dialogue, evaluates holistic pedagogical capabilities against six core principles (e.g., Challenge, Explanation, Feedback), leveraging a novel multi-turn dialogue dataset constructed by weaving together diverse pedagogical components. The second module, KMP-Skills, provides a granular assessment of foundational tutoring abilities, including multi-turn problem-solving, error detection and correction, and problem generation. Our evaluations on KMP-Bench reveal a key disparity: while leading LLMs excel at tasks with verifiable solutions, they struggle with the nuanced application of pedagogical principles. Additionally, we present KMP-Pile, a large-scale (150K) dialogue dataset. Models fine-tuned on KMP-Pile show substantial improvement on KMP-Bench, underscoring the value of pedagogically-rich training data for developing more effective AI math tutors.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — pedagogical benchmark

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Weikang Shi , Houxing Ren , Junting Pan , Aojun Zhou , Ke Wang , Zimu Lu , Yunqiao Yang , Yuxuan Hu , Linda Wei , Mingjie Zhan , hongsheng Li

Topics

Artificial Intelligence > Core AI > Human-AI Interaction Machine Learning > Application Areas > Data Augmentation Natural Language Processing > Generation > Dialogue Systems

Keywords

dialogue system dialogue dataset large language model mathematical tutoring pedagogical benchmark

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026