ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training

Maryam Dialameh; Rezaul Karim; Hossein Rajabzadeh; Omar Mohamed Awad; Boxing Chen; Hyock Ju Kwon; Walid Ahmed; Yang Liu

2025 EMNLP EMNLP 2025

ECHO-LLaMA: Efficient Caching for High-Performance LLaMA Training

Abstract

AbstractThis paper introduces ECHO-LLaMA, an efficient LLaMA architecture designed to improve both the training speed and inference throughput of LLaMA architectures while maintaining its learning capacity. ECHO-LLaMA transforms LLaMA models into shared KV caching across certain layers, significantly reducing KV computational complexity while maintaining or improving language performance. Experimental results demonstrate that ECHO-LLaMA achieves up to 77% higher token-per-second throughput during training, up to 16% higher Model FLOPs Utilization (MFU), and up to 14% lower loss when trained on an equal number of tokens. Furthermore, on the 1.1B model, ECHO-LLaMA delivers approximately 7% higher test-time throughput compared to the baseline. By introducing a computationally efficient adaptation mechanism, ECHO-LLaMA offers a scalable and cost-effective solution for pretraining and finetuning large language models, enabling faster and more resource-efficient training without compromising performance.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — kv caching

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Maryam Dialameh , Rezaul Karim , Hossein Rajabzadeh , Omar Mohamed Awad , Boxing Chen , Hyock Ju Kwon , Walid Ahmed , Yang Liu

Topics

Artificial Intelligence > Core AI > Model Compression Machine Learning > Optimization & Theory > Neural Network Optimization Machine Learning > Application Areas > Efficient Computing Machine Learning > Application Areas > Model Compression Artificial Intelligence > Core AI > Large Language Models Deep Learning > Models > Large Language Models Artificial Intelligence > Core AI > Efficient Computing Deep Learning > Optimization & Theory > Neural Network Optimization Deep Learning > Optimization & Theory > Model Compression Deep Learning > Optimization & Theory > Efficient Computing

Keywords

model compression transformer architecture efficient computing training optimization parameter sharing training efficiency inference optimization inference throughput model efficiency large language model kv caching

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025