Beyond One-Step Distillation: Bridging the Capacity Gap in Small Language Models via Multi-Step Knowledge Transfer

Gaeun Yim; Nayoung Ko; Manasa Bharadwaj

2026 EACL EACL 2026

Beyond One-Step Distillation: Bridging the Capacity Gap in Small Language Models via Multi-Step Knowledge Transfer

Abstract

AbstractLarge Language Models (LLMs) excel across diverse NLP tasks but remain too large for efficient on-device deployment. Although knowledge distillation offers a promising compression strategy, direct one-step distillation from a large teacher to a small student often leads to substantial performance loss due to the capacity gap. In this work, we revisit multi-step knowledge distillation (MSKD) as an effective remedy, exploring how staged, size-aware transfer paths can better preserve teacher knowledge across students of varying scales. Through extensive experiments with GPT-2 and OPT, we demonstrate that MSKD consistently improves ROUGE-L and perplexity over single-step approaches without requiring specialized fine-tuning. Our results establish multi-step transfer as a simple yet powerful framework for progressively compressing LLMs into efficient, high-performing Small Language Models (SLMs).

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — multi-step distillation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Gaeun Yim , Nayoung Ko , Manasa Bharadwaj

Topics

Artificial Intelligence > Core AI > Model Compression Machine Learning > Application Areas > Knowledge Distillation Natural Language Processing > Resources & Methods > Large Language Models

Keywords

model compression knowledge distillation small language model capacity gap multi-step distillation

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026