← Optimization & Theory

Machine Learning › Optimization & Theory ›

Evaluation

515 directly classified papers

Papers per year

Papers

Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization ACL 2025

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models ACL 2025

LLMs can be easily Confused by Instructional Distractions ACL 2025

A Statistical and Multi-Perspective Revisiting of the Membership Inference Attack in Large Language Models ACL 2025

Rubrik’s Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset ACL 2025

HalluLens: LLM Hallucination Benchmark ACL 2025

PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models ACL 2025

On Many-Shot In-Context Learning for Long-Context Evaluation ACL 2025

BIG-Bench Extra Hard ACL 2025

A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability ACL 2025

Language Model Probabilities are Not Calibrated in Numeric Contexts ACL 2025

FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation ACL 2025

Standard Quality Criteria Derived from Current NLP Evaluations for Guiding Evaluation Design and Grounding Comparability and AI Compliance Assessments ACL 2025

An Analysis of Datasets, Metrics and Models in Keyphrase Generation ACL 2025

Theory of Mind in Large Language Models: Assessment and Enhancement ACL 2025

EXECUTE: A Multilingual Benchmark for LLM Token Understanding ACL 2025

Do Language Models Understand the Cognitive Tasks Given to Them? Investigations with the N-Back Paradigm ACL 2025

ARC ‘Challenge’ Is Not That Challenging ACL 2025

Model Performance-Guided Evaluation Data Selection for Effective Prompt Optimization ACL 2025

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4 ACL 2025

VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation ACL 2025

Random Splitting Negatively Impacts NER Evaluation: Quantifying and Eliminating the Overestimation of NER Performance ACL 2025

LSC-Eval: A General Framework to Evaluate Methods for Assessing Dimensions of Lexical Semantic Change Using LLM-Generated Synthetic Data ACL 2025

SEA-HELM: Southeast Asian Holistic Evaluation of Language Models ACL 2025

QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation EMNLP 2025