Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Optimization & Theory
Machine Learning
›
Optimization & Theory
›
Evaluation
515 directly classified papers
Papers per year
2003: 1
2004: 1
2005: 1
2006: 1
2008: 2
2009: 1
2010: 1
2013: 5
2016: 3
2017: 8
2018: 11
2019: 24
2020: 25
2021: 34
2022: 68
2023: 74
2024: 105
2025: 147
2026: 3
Papers
QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation
EMNLP 2025
Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy
NAACL 2025
Investigating Value-Reasoning Reliability in Small Large Language Models
EMNLP 2025
Are Checklists Really Useful for Automatic Evaluation of Generative Tasks?
EMNLP 2025
Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks
EMNLP 2025
Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
EMNLP 2025
Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs
EMNLP 2025
Metric Calculating Benchmark: Code-Verifiable Complicate Instruction Following Benchmark for Large Language Models
EMNLP 2025
EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models
EMNLP 2025
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon
EMNLP 2025
From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models
EMNLP 2025
In Benchmarks We Trust ... Or Not?
EMNLP 2025
Evaluating the Evaluators: Are readability metrics good measures of readability?
EMNLP 2025
TFDP: Token-Efficient Disparity Audits for Autoregressive LLMs via Single-Token Masked Evaluation
EMNLP 2025
From Parameters to Performance: A Data-Driven Study on LLM Structure and Development
EMNLP 2025
Model Consistency as a Cheap yet Predictive Proxy for LLM Elo Scores
EMNLP 2025
Estimating LLM Consistency: A User Baseline vs Surrogate Metrics
EMNLP 2025
Revealing and Mitigating the Challenge of Detecting Character Knowledge Errors in LLM Role-Playing
EMNLP 2025
Benchmarking LLMs on Semantic Overlap Summarization
EMNLP 2025
KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts
EMNLP 2025
TounsiBench: Benchmarking Large Language Models for Tunisian Arabic
EMNLP 2025
Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards
EMNLP 2025
GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems
EMNLP 2025
Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses
EMNLP 2025
Feeding Two Birds or Favoring One? Adequacy–Fluency Tradeoffs in Evaluation and Meta-Evaluation of Machine Translation
EMNLP 2025
<
1
…
5
6
7
…
21
>