Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Optimization & Theory
Machine Learning
›
Optimization & Theory
›
Evaluation
515 directly classified papers
Papers per year
2003: 1
2004: 1
2005: 1
2006: 1
2008: 2
2009: 1
2010: 1
2013: 5
2016: 3
2017: 8
2018: 11
2019: 24
2020: 25
2021: 34
2022: 68
2023: 74
2024: 105
2025: 147
2026: 3
Papers
Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization
ACL 2025
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models
ACL 2025
LLMs can be easily Confused by Instructional Distractions
ACL 2025
A Statistical and Multi-Perspective Revisiting of the Membership Inference Attack in Large Language Models
ACL 2025
Rubrik’s Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset
ACL 2025
HalluLens: LLM Hallucination Benchmark
ACL 2025
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models
ACL 2025
On Many-Shot In-Context Learning for Long-Context Evaluation
ACL 2025
BIG-Bench Extra Hard
ACL 2025
A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability
ACL 2025
Language Model Probabilities are Not Calibrated in Numeric Contexts
ACL 2025
FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation
ACL 2025
Standard Quality Criteria Derived from Current NLP Evaluations for Guiding Evaluation Design and Grounding Comparability and AI Compliance Assessments
ACL 2025
An Analysis of Datasets, Metrics and Models in Keyphrase Generation
ACL 2025
Theory of Mind in Large Language Models: Assessment and Enhancement
ACL 2025
EXECUTE: A Multilingual Benchmark for LLM Token Understanding
ACL 2025
Do Language Models Understand the Cognitive Tasks Given to Them? Investigations with the N-Back Paradigm
ACL 2025
ARC ‘Challenge’ Is Not That Challenging
ACL 2025
Model Performance-Guided Evaluation Data Selection for Effective Prompt Optimization
ACL 2025
An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4
ACL 2025
VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation
ACL 2025
Random Splitting Negatively Impacts NER Evaluation: Quantifying and Eliminating the Overestimation of NER Performance
ACL 2025
LSC-Eval: A General Framework to Evaluate Methods for Assessing Dimensions of Lexical Semantic Change Using LLM-Generated Synthetic Data
ACL 2025
SEA-HELM: Southeast Asian Holistic Evaluation of Language Models
ACL 2025
QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation
EMNLP 2025
<
1
2
3
4
5
…
21
>