Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation

Simin Chen; Yiming Chen; Zexin Li; Yifan Jiang; Zhongwei Wan; Yixin He; Dezhi Ran; Tianle Gu; Haizhou Li; Tao Xie; Baishakhi Ray

2025 EMNLP EMNLP 2025

Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation

Abstract

AbstractIn the era of evaluating large language models (LLMs), data contamination has become an increasingly prominent concern. To address this risk, LLM benchmarking has evolved from a *static* to a *dynamic* paradigm. In this work, we conduct an in-depth analysis of existing *static* and *dynamic* benchmarks for evaluating LLMs. We first examine methods that enhance *static* benchmarks and identify their inherent limitations. We then highlight a critical gap—the lack of standardized criteria for evaluating *dynamic* benchmarks. Based on this observation, we propose a series of optimal design principles for *dynamic* benchmarking and analyze the limitations of existing *dynamic* benchmarks.This survey provides a concise yet comprehensive overview of recent advancements in data contamination research, offering valuable insights and a clear guide for future research efforts. We maintain a GitHub repository to continuously collect both static and dynamic benchmarking methods for LLMs. The repository can be found at this link.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — static evaluation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Simin Chen , Yiming Chen , Zexin Li , Yifan Jiang , Zhongwei Wan , Yixin He , Dezhi Ran , Tianle Gu , Haizhou Li , Tao Xie , Baishakhi Ray

Topics

Machine Learning > Optimization & Theory > Theory Machine Learning > Application Areas > Efficient Computing Natural Language Processing > Resources & Methods > Large Language Models Artificial Intelligence > Core AI > Large Language Models Machine Learning > Learning Types > Deep Learning Machine Learning > Learning Types > Evaluation

Keywords

benchmark evaluation data contamination dynamic evaluation large language model static evaluation

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025