Benchmarking LLMs’ Mathematical Reasoning with Unseen Random Variables Questions

Zijin Hong; Hao Wu; Su Dong; Junnan Dong; Yilin Xiao; Yujing Zhang; Zhu Wang; Feiran Huang; Linyi Li; Hongxia Yang; Xiao Huang

2026 AAAI AAAI 2026

Benchmarking LLMs’ Mathematical Reasoning with Unseen Random Variables Questions

Abstract

Abstract Recent studies have raised significant concerns regarding the reliability of current mathematical benchmarks, highlighting key limitations such as simplistic design and potential data contamination that undermine evaluation accuracy. Consequently, developing a reliable benchmark that effectively evaluates large language models' (LLMs) genuine capabilities in mathematical reasoning remains a critical challenge. To address these concerns, we propose RV-Bench, a novel evaluation methodology for Benchmarking LLMs with Random Variables in mathematical reasoning. Specifically, we develop question-generating functions to produce random variable questions (RVQs), whose background content mirrors the original benchmark problems, but with randomized variable combinations, rendering them "unseen" to LLMs. Models must completely understand the inherent question pattern to correctly answer RVQs with diverse variable combinations. Thus, an LLMs' genuine reasoning capability is reflected through its accuracy and robustness on RV-Bench. We conducted extensive experiments on over 30 representative LLMs across more than 1,000 RVQs. Our findings reveal that LLMs exhibit a proficiency imbalance between encountered and "unseen" data distributions. Furthermore, RV-Bench reveals that proficiency generalization across similar mathematical reasoning tasks is limited, but we verified that it can still be effectively elicited through test-time scaling.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — random variable question

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zijin Hong , Hao Wu , Su Dong , Junnan Dong , Yilin Xiao , Yujing Zhang , Zhu Wang , Feiran Huang , Linyi Li , Hongxia Yang , Xiao Huang

Topics

Artificial Intelligence > Core AI > Foundation Models Machine Learning > Optimization & Theory > Learning Theory

Keywords

benchmark evaluation mathematical reasoning data contamination test-time scaling random variable question unseen data distribution

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026