KRR: Efficient and Scalable Kernel Record Replay

Tianren Zhang; Sishuai Gong; Pedro Fonseca

2025 OSDI OSDI 2025

KRR: Efficient and Scalable Kernel Record Replay

Abstract

Modern kernels are large, complex, and plagued with bugs. Unfortunately, their large size and complexity make kernel failures very challenging for developers to diagnose since failures encountered in deployment are often notoriously difficult to reproduce. Although record-replay techniques provide the powerful ability to accurately record a failed execution and deterministically replay it, enabling advanced manual and automated analysis techniques, they are inefficient and do not scale with modern I/O-intensive, concurrent workloads. This paper introduces KRR, a kernel record-replay framework that provides a highly efficient execution recording mechanism by narrowing the scope of the record and replay boundary to the kernel. Unlike previous record-replay whole-stack approaches, KRR adopts a split-recorder design that employs the guest and the host to jointly record the kernel execution. Our evaluation demonstrates that KRR scales efficiently up to 8 cores, across a range of different workloads, including kernel compilation, RocksDB, and Nginx. When recording 8-core VMs that run RocksDB and kernel compilation, KRR incurs only a 1.52× ~ 2.79× slowdown compared to native execution, while traditional whole-VM RR suffers from 8.97× ~ 29.94× slowdown. We validate that KRR is practical and has a broad recording scope by reproducing 17 bugs across different Linux versions, including 6 non-deterministic bugs and 5 high-risk CVEs; KRR was able to record and reproduce all but one non-deterministic bug.

🧭 Keyword Pioneer — kernel debugging

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Tianren Zhang , Sishuai Gong , Pedro Fonseca

Topics

Machine Learning > Optimization & Theory > Optimization Machine Learning > Application Areas > Efficient Computing

Keywords

virtual machine performance overhead deterministic execution record and replay kernel debugging deterministic replay bug reproduction kernel record replay execution recording concurrent workload record replay

Download PDF

Related papers

OS Rendering Service Made Parallel with Out-of-Order Execution and In-Order Commit 2025

Deriving Semantic Checkers from Tests to Detect Silent Failures in Production Distributed Systems 2025

FineMem: Breaking the Allocation Overhead vs. Memory Waste Dilemma in Fine-Grained Disaggregated Memory Management 2025

Tigon: A Distributed Database for a CXL Pod 2025

Scalio: Scaling up DPU-based JBOF Key-value Store with NVMe-oF Target Offload 2025