In-Situ Eval: A Modular Framework for Custom and Real-Time RAG Benchmarking

Ritvik Garimella; Kaushik Roy; Chathurangi Shyalika; Amit Sheth

2026 AAAI AAAI 2026

In-Situ Eval: A Modular Framework for Custom and Real-Time RAG Benchmarking

Abstract

Abstract Retrieval-Augmented Generation (RAG) has become the standard approach for integrating domain knowledge into Large Language Models (LLMs). However, fair comparison of RAG pipelines remains difficult: data preparation is often ad hoc, subsampling methods are opaque, parameters vary across implementations, and evaluation is fragmented. We present In-Situ Eval, a unified and reproducible framework that operationalizes the full RAG pipeline with configurable subsampling strategies and both RAG-specific and generic evaluation metrics. The platform supports two execution modes: an offline Dataset mode for evaluating precomputed outputs, and a live Retrieval mode for benchmarking RAG variants with state-of-the-art LLMs. Users can flexibly select datasets, retrieval techniques, models, and metrics, enabling side-by-side comparisons, ablations, and targeted analyses. This holistic approach reduces computational costs, clarifies the impact of subsampling techniques, and provides actionable insights for real-world deployments. By facilitating transparent, customizable, and interactive benchmarking, In-Situ Eval empowers both researchers and practitioners to make informed decisions in adapting RAG pipelines to domain-specific needs.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — subsampling technique

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ritvik Garimella , Kaushik Roy , Chathurangi Shyalika , Amit Sheth

Topics

Machine Learning > Application Areas > Efficient Computing Natural Language Processing > Resources & Methods > Large Language Models Natural Language Processing > Resources & Methods > Natural Language Inference

Keywords

retrieval-augmented generation evaluation metrics pipeline optimization benchmarking framework real-time evaluation subsampling technique

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026