RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

Kunlun Zhu; Yifan Luo; Dingling Xu; Yukun Yan; Zhenghao Liu; Shi Yu; Ruobing Wang; Shuo Wang; Yishan Li; Nan Zhang; Xu Han; Zhiyuan Liu; Maosong Sun

2025 ACL ACL 2025

RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

Abstract

AbstractRetrieval-Augmented Generation (RAG) is a powerful approach that enables large language models (LLMs) to incorporate external knowledge. However, evaluating the effectiveness of RAG systems in specialized scenarios remains challenging due to the high costs of data construction and the lack of suitable evaluation metrics. This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios by generating high-quality documents, questions, answers, and references through a schema-based pipeline. With a focus on factual accuracy, we propose three novel metrics—Completeness, Hallucination, and Irrelevance—to evaluate LLM-generated responses rigorously. Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples. Furthermore, the use of LLMs for scoring the proposed metrics demonstrates a high level of consistency with human evaluations. RAGEval establishes a new paradigm for evaluating RAG systems in real-world applications. The code and dataset are released at https://github.com/OpenBMB/RAGEval.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — scenario-specific evaluation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Kunlun Zhu , Yifan Luo , Dingling Xu , Yukun Yan , Zhenghao Liu , Shi Yu , Ruobing Wang , Shuo Wang , Yishan Li , Nan Zhang , Xu Han , Zhiyuan Liu , Maosong Sun

Topics

Artificial Intelligence > Core AI > Foundation Models Machine Learning > Optimization & Theory > Optimization Natural Language Processing > Applications > Information Retrieval Data Science & Analytics > Applications > Information Retrieval Machine Learning > Learning Types > Evaluation Deep Learning > Learning Types > Retrieval-Augmented Generation

Keywords

factual accuracy evaluation metric retrieval-augmented generation hallucination detection dataset generation scenario-specific evaluation

Download PDF

Graphically Speaking: Unmasking Abuse in Social Media with Conversation Insights 2025

CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision 2025

Structural Deep Encoding for Table Question Answering 2025

Vision-aided Unsupervised Constituency Parsing with Multi-MLLM Debating 2025

RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework

Abstract

Authors

Topics

Keywords

Related papers