LLMs on a Budget? Say HOLA

Zohaib Hasan Siddiqui; Jiechao Gao; Ebad Shabbir; Mohammad Anas Azeez; Rafiq Ali; Gautam Siddharth Kashyap; Usman Naseem

2025 EMNLP EMNLP 2025

LLMs on a Budget? Say HOLA

Abstract

AbstractRunning Large Language Models (LLMs) on edge devices is constrained by high compute and memory demands—posing a barrier for real-time applications in industries like healthcare, education, and embedded systems. Current solutions such as quantization, pruning, and Retrieval-Augmented Generation (RAG) offer only partial optimizations and often compromise on speed or accuracy. We introduce HOLA, an end-to-end optimization framework for efficient LLM deployment. Internally, it leverages Hierarchical Speculative Decoding (HSD) for faster inference without quality loss. Externally, AdaComp-RAG adjusts retrieval complexity based on context needs. Together with Lo-Bi, which blends structured pruning (LoRA) and quantization, HOLA delivers significant gains: +17.6% EMA on GSM8K, +10.5% MCA on ARC, and reduced latency and memory on edge devices like Jetson Nano—proving both scalable and production-ready. Our code is available at: https://github.com/zohaibhasan066/HOLA_Codebase

❓ The Questioner

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — hierarchical speculative decoding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zohaib Hasan Siddiqui , Jiechao Gao , Ebad Shabbir , Mohammad Anas Azeez , Rafiq Ali , Gautam Siddharth Kashyap , Usman Naseem

Topics

Artificial Intelligence > Core AI > Model Compression Artificial Intelligence > Learning Paradigms > Transfer Learning Machine Learning > Application Areas > Efficient Computing Deep Learning > Architectures > Transformers Machine Learning > Application Areas > Model Compression Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Efficient Computing Machine Learning > Learning Types > Retrieval-Augmented Generation Deep Learning > Optimization & Theory > Model Compression

Keywords

model compression model quantization model pruning edge computing retrieval-augmented generation speculative decoding parameter-efficient finetuning large language model hierarchical speculative decoding

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025