FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Models

Zishan Shao; Yixiao Wang; Qinsi Wang; Ting Jiang; Zhixu Du; Hancheng Ye; Danyang Zhuo; Yiran Chen; Hai Helen Li

2026 AAAI AAAI 2026

FlashSVD: Memory-Efficient Inference with Streaming for Low-Rank Models

Abstract

Abstract Singular Value Decomposition (SVD) has recently gained traction as an effective compression technique for large language models (LLMs), with many studies reporting 20-80% parameter reduction at minimal accuracy cost. However, despite reducing weight memory, existing SVD-based approaches still rely on standard dense CUDA kernels during inference, which incur substantial-and ultimately unnecessary-activation memory overhead. Our analysis reveals that this kernel-induced cost, which grows with sequence length and hidden size, in worst case prevents any real reduction in peak inference memory, limiting the practical impact of SVD compression for on-device deployment. To address this bottleneck, we propose FlashSVD, an end-to-end, rank-aware streaming inference framework for SVD-compressed LLMs. FlashSVD integrates seamlessly with any SVD-based model and directly fuses low-rank projection kernels into self-attention and feed-forward pipelines. This design avoids materializing large activation buffers by streaming small tiles of truncated factors through on-chip SRAM, performing on-the-fly multiplication and reduction, and immediately evicting results–thus preserving high GPU occupancy without introducing latency. On standard benchmarks (e.g., BERT-Base), FlashSVD reduces peak activation memory by up to 70.2% and transient memory by 75%, with zero accuracy loss against low-rank baselines, enabling truly memory-efficient deployment of low-rank LLMs.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zishan Shao , Yixiao Wang , Qinsi Wang , Ting Jiang , Zhixu Du , Hancheng Ye , Danyang Zhuo , Yiran Chen , Hai Helen Li

Topics

Artificial Intelligence > Core AI > Model Compression Machine Learning > Application Areas > Efficient Computing

Keywords

model compression singular value decomposition streaming inference memory efficiency low-rank model

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026