TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text

Songshuo Lu; Hua Wang; Yutian Rong; Zhi Chen; Yaohua Tang

2025 EMNLP EMNLP 2025

TurboRAG: Accelerating Retrieval-Augmented Generation with Precomputed KV Caches for Chunked Text

Abstract

AbstractCurrent Retrieval-Augmented Generation (RAG) systems concatenate and process numerous retrieved document chunks for prefill which requires a large volume of computation, therefore leading to significant latency in time-to-first-token (TTFT). To reduce the computation overhead as well as TTFT, we introduce TurboRAG, a hybrid offline–online paradigm that (i) pre‐computes chunk‐level key-value (KV) caches, (ii) stitches them together at inference time using independent–attention and reordered‐RoPE techniques, and (iii) preserves answer quality without changing the model architecture. Hence, online computation of KV caches is eliminated during inference. Our approach is applicable to most existing large language models and their applications without any requirement in modification of models and inference systems. Experimental results across a suite of RAG benchmarks demonstrate that TurboRAG reduces TTFT by up to 9.4x compared to the conventional RAG systems (on an average of 8.6x), but reserving comparable performance to the standard RAG systems.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — prefill computation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Songshuo Lu , Hua Wang , Yutian Rong , Zhi Chen , Yaohua Tang

Topics

Machine Learning > Application Areas > Efficient Computing Natural Language Processing > Applications > Information Retrieval Natural Language Processing > Resources & Methods > Large Language Models Artificial Intelligence > Core AI > Large Language Models Machine Learning > Learning Types > Retrieval-Augmented Generation Deep Learning > Optimization & Theory > Optimization

Keywords

retrieval augmented generation inference optimization retrieval-augmented generation inference acceleration kv cache key-value cache latency optimization text chunking large language model prefill computation key value cache chunked text time to first token

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025