Scaling Down, Serving Fast: Compressing and Deploying Efficient LLMs for Recommendation Systems

Kayhan Behdin; Ata Fatahibaarzi; Qingquan Song; Yun Dai; Aman Gupta; Zhipeng Wang; Hejian Sang; Shao Tang; Gregory Dexter; Sirou Zhu; Siyu Zhu; Tejas Dharamsi; Vignesh Kothapalli; Zhoutong Fu; Yihan Cao; Pin-Lun Hsu; Fedor Borisyuk; Natesh S. Pillai; Luke Simon; Rahul Mazumder

2025 EMNLP EMNLP 2025

Scaling Down, Serving Fast: Compressing and Deploying Efficient LLMs for Recommendation Systems

Abstract

AbstractLarge language models (LLMs) have demonstrated remarkable performance across a wide range of industrial applications, from search and recommendation systems to generative tasks. Although scaling laws indicate that larger models generally yield better generalization and performance, their substantial computational requirements often render them impractical for many real-world scenarios at scale. In this paper, we present a comprehensive set of insights for training and deploying small language models (SLMs) that deliver high performance for a variety of industry use cases. We focus on two key techniques: (1) knowledge distillation and (2) model compression via structured pruning and quantization. These approaches enable SLMs to retain much of the quality of their larger counterparts while significantly reducing training/serving costs and latency. We detail the impact of these techniques on a variety of use cases in a large professional social network platform and share deployment lessons, including hardware optimization strategies that improve speed and throughput for both predictive and reasoning-based applications in Recommendation Systems.

👥 Mega-Team — 20 authors

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Topics

Machine Learning > Application Areas > Efficient Computing Machine Learning > Application Areas > Knowledge Distillation Machine Learning > Core Methods > Model Compression Machine Learning > Application Areas > Recommender Systems Deep Learning > Optimization & Theory > Model Compression Deep Learning > Techniques > Knowledge Distillation

Keywords

model compression model quantization knowledge distillation efficient computing recommendation system recommender system structured pruning small language model

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025