LLM-Oriented Token-Adaptive Knowledge Distillation

Xurong Xie; Zhucun Xue; Jiafu Wu; Jian Li; Yabiao Wang; Xiaobin Hu; Yong Liu; Jiangning Zhang

2026 AAAI AAAI 2026

LLM-Oriented Token-Adaptive Knowledge Distillation

Abstract

Abstract Knowledge Distillation (KD) is a key technique for compressing Large-scale Language Models (LLMs), but prevailing logit-based methods employ static strategies misaligned with the student’s dynamic learning process. By treating all tokens indiscriminately with a fixed temperature, these methods result in suboptimal knowledge transfer. To address this, we propose LLM-oriented token-Adaptive Knowledge Distillation (AdaKD), a framework that adapts the distillation process to each token’s real-time learning state. AdaKD consists of two synergistic modules driven by a unified token difficulty metric. First, the Loss-driven Adaptive Token Focusing (LATF) module dynamically concentrates distillation on valuable tokens by monitoring the student’s learning stability. Second, Inverse Difficulty Temperature Scaling (IDTS) introduces a counterintuitive token-level temperature: low for difficult tokens to target error correction, and high for easy tokens to learn the teacher’s smooth output distribution for better generalization. As a plug-and-play framework, AdaKD consistently improves performance across diverse distillation methods, model architectures, and benchmarks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xurong Xie , Zhucun Xue , Jiafu Wu , Jian Li , Yabiao Wang , Xiaobin Hu , Yong Liu , Jiangning Zhang

Topics

Artificial Intelligence > Core AI > Model Compression Machine Learning > Application Areas > Knowledge Distillation

Keywords

knowledge distillation language model temperature scaling student model teacher model token-level adaptation

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026