Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Optimization & Theory
Machine Learning
›
Optimization & Theory
›
Evaluation
515 directly classified papers
Papers per year
2003: 1
2004: 1
2005: 1
2006: 1
2008: 2
2009: 1
2010: 1
2013: 5
2016: 3
2017: 8
2018: 11
2019: 24
2020: 25
2021: 34
2022: 68
2023: 74
2024: 105
2025: 147
2026: 3
Papers
Identifying Open Challenges in Language Identification
ACL 2025
Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs
NAACL 2025
Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization
ACL 2025
Are Small Language Models Ready to Compete with Large Language Models for Practical Applications?
NAACL 2025
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models
ACL 2025
LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs
NAACL 2025
LLMs can be easily Confused by Instructional Distractions
ACL 2025
Advocating Character Error Rate for Multilingual ASR Evaluation
NAACL 2025
A Statistical and Multi-Perspective Revisiting of the Membership Inference Attack in Large Language Models
ACL 2025
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
NAACL 2025
Rubrik’s Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset
ACL 2025
GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems
EMNLP 2025
HalluLens: LLM Hallucination Benchmark
ACL 2025
A Conformal Risk Control Framework for Granular Word Assessment and Uncertainty Calibration of CLIPScore Quality Estimates
ACL 2025
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models
ACL 2025
Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards
EMNLP 2025
On Many-Shot In-Context Learning for Long-Context Evaluation
ACL 2025
RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks
EMNLP 2025
BIG-Bench Extra Hard
ACL 2025
Trade-offs in Image Generation: How Do Different Dimensions Interact?
ICCV 2025
A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability
ACL 2025
Accurate Estimation of Feature Importance Faithfulness for Tree Models
AAAI 2025
Language Model Probabilities are Not Calibrated in Numeric Contexts
ACL 2025
Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses
EMNLP 2025
Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models
COLING 2025
<
1
…
4
5
6
…
21
>