DCR: Quantifying Data Contamination in LLMs Evaluation

Cheng Xu; Nan Yan; Shuhao Guan; Changhong Jin; Yuke Mei; Yibing Guo; Tahar Kechadi

2025 EMNLP EMNLP 2025

DCR: Quantifying Data Contamination in LLMs Evaluation

Abstract

AbstractThe rapid advancement of large language models (LLMs) has heightened concerns about benchmark data contamination (BDC), where models inadvertently memorize evaluation data during the training process, inflating performance metrics, and undermining genuine generalization assessment. This paper introduces the Data Contamination Risk (DCR) framework, a lightweight, interpretable pipeline designed to detect and quantify BDC risk across four granular levels: semantic, informational, data, and label. By synthesizing contamination scores via a fuzzy inference system, DCR produces a unified DCR Factor that adjusts raw accuracy to reflect contamination-aware performance. Validated on 9 LLMs (0.5B-72B) across sentiment analysis, fake news detection, and arithmetic reasoning tasks, the DCR framework reliably diagnoses contamination severity and with accuracy adjusted using the DCR Factor to within 4% average error across the three benchmarks compared to the uncontaminated baseline. Emphasizing computational efficiency and transparency, DCR provides a practical tool for integrating contamination assessment into routine evaluations, fostering fairer comparisons and enhancing the credibility of LLM benchmarking practices.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Cheng Xu , Nan Yan , Shuhao Guan , Changhong Jin , Yuke Mei , Yibing Guo , Tahar Kechadi

Topics

Artificial Intelligence > Core AI > Responsible AI Machine Learning > Application Areas > Efficient Computing Natural Language Processing > Applications > Text Classification Artificial Intelligence > Core AI > Large Language Models Machine Learning > Learning Types > Deep Learning Machine Learning > Learning Types > Evaluation Deep Learning > Optimization & Theory > Evaluation

Keywords

benchmark evaluation sentiment analysis model generalization data contamination fake news detection large language model fuzzy inference system

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025