2025 ACL ACL 2025

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in Large Language Models

Abstract

AbstractCurrent evaluations of mathematical skills in Large Language Models are constrained by benchmarks lacking scope, particularly for multi-modal problems — frequently relying on school-level, niche Olympiad-style, simple quiz-format, or relatively small datasets.To address this, we introduce **U-MATH**, a novel benchmark comprising **1,100** unpublished open-ended university-level problems sourced from current US curricula, with **20%** incorporating visual elements. Given the free-form nature of U-MATH problems, we employ LLM judges for solution evaluation and release 𝜇**-MATH**, a meta-evaluation benchmark composed of **1,084** U-MATH-derived tasks enabling precise assessment of these judges.Benchmarking leading LLMs reveals marked limitations in multi-modal reasoning, with maximum accuracy reaching 93.1% on textual tasks but only 58.5% on visual ones. Furthermore, solution judgment proves challenging, requiring the most advanced models to achieve meaningfully high performance, even still peaking at an imperfect F1-score of 90.1%.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Natural Language Processing
🧭 Keyword Pioneer — university-level problem
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio