FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference

Yu-Chen Lu; Chong-Yan Chen; Chi-Chih Chang; Yu-Fang Hu; Kai-Chiang Wu

2025 EMNLP EMNLP 2025

FLRC: Fine-grained Low-Rank Compressor for Efficient LLM Inference

Abstract

AbstractAlthough large language models (LLM) have achieved remarkable performance, their enormous parameter counts hinder deployment on resource-constrained hardware. Low-rank compression can reduce both memory usage and computational demand, but applying a uniform compression ratio across all layers often leads to significant performance degradation, and previous methods perform poorly during decoding. To address these issues, we propose the Fine-grained Low-Rank Compressor (FLRC), which efficiently determines an optimal rank allocation for each layer, and incorporates progressive low-rank decoding to maintain text generation quality. Comprehensive experiments on diverse benchmarks demonstrate the superiority of FLRC, achieving up to a 17% improvement in ROUGE-L on summarization tasks compared to state-of-the-art low-rank compression methods, establishing a more robust and efficient framework to improve LLM inference.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yu-Chen Lu , Chong-Yan Chen , Chi-Chih Chang , Yu-Fang Hu , Kai-Chiang Wu

Topics

Artificial Intelligence > Core AI > Model Compression Machine Learning > Optimization & Theory > Optimization Machine Learning > Application Areas > Efficient Computing Machine Learning > Application Areas > Model Compression Artificial Intelligence > Core AI > Large Language Models Deep Learning > Optimization & Theory > Optimization

Keywords

model compression text generation inference efficiency parameter efficiency rank allocation progressive decoding low-rank compression

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025