Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models

Shubham Agarwal; Subrata Mitra; Sarthak Chakraborty; Srikrishna Karanam; Koyel Mukherjee; Shiv Kumar Saini

2024 NSDI NSDI 2024

Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models

Abstract

Text-to-image generation using diffusion models has seen explosive popularity owing to their ability in producing high quality images adhering to text prompts. However, diffusion-models go through a large number of iterative denoising steps, and are resource-intensive, requiring expensive GPUs and incurring considerable latency. In this paper, we introduce a novel approximate-caching technique that can reduce such iterative denoising steps by reusing intermediate noise states created during a prior image generation. Based on this idea, we present an end-to-end text-to-image generation system, NIRVANA, that uses approximate-caching with a novel cache management policy to provide 21% GPU compute savings, 19.8% end-to-end latency reduction, and 19% dollar savings on two real production workloads. We further present an extensive characterization of real production text-to-image prompts from the perspective of caching, popularity and reuse of intermediate states in a large production environment.

🧭 Keyword Pioneer — approximate caching

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shubham Agarwal , Subrata Mitra , Sarthak Chakraborty , Srikrishna Karanam , Koyel Mukherjee , Shiv Kumar Saini

Topics

Artificial Intelligence > Core AI > Foundation Models

Keywords

text-to-image generation diffusion model inference optimization latency reduction approximate caching

Download PDF

Related papers

Accelerating Skewed Workloads With Performance Multipliers in the TurboDB Distributed Database 2024

Efficient Exposure of Partial Failure Bugs in Distributed Systems with Inferred Abstract States 2024

Making Kernel Bypass Practical for the Cloud with Junction 2024

Horus: Granular In-Network Task Scheduler for Cloud Datacenters 2024

Fast Vector Query Processing for Large Datasets Beyond GPU Memory with Reordered Pipelining 2024