Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Optimization & Theory
Machine Learning
›
Optimization & Theory
›
Evaluation
515 directly classified papers
Papers per year
2003: 1
2004: 1
2005: 1
2006: 1
2008: 2
2009: 1
2010: 1
2013: 5
2016: 3
2017: 8
2018: 11
2019: 24
2020: 25
2021: 34
2022: 68
2023: 74
2024: 105
2025: 147
2026: 3
Papers
Making a Long Story Short in Conversation Modeling
EACL 2024
Fréchet Distance for Offline Evaluation of Information Retrieval Systems with Sparse Labels
EACL 2024
Approximate Leave-one-out Cross Validation for Regression with $\ell_1$ Regularizers
AISTATS 2024
Towards a new Benchmark for Emotion Detection in NLP: A Unifying Framework of Recent Corpora
EMNLP 2024
NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes
ACL 2024
Marathon: A Race Through the Realm of Long Context with Large Language Models
ACL 2024
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
NIPS 2024
Benchmark Data Repositories for Better Benchmarking
NIPS 2024
Is Cross-validation the Gold Standard to Estimate Out-of-sample Model Performance?
NIPS 2024
ANAH: Analytical Annotation of Hallucinations in Large Language Models
ACL 2024
TabularBench: Benchmarking Adversarial Robustness for Tabular Deep Learning in Real-world Use-cases
NIPS 2024
AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents
NIPS 2024
Adaptive Labeling for Efficient Out-of-distribution Model Evaluation
NIPS 2024
MSLC24 Submissions to the General Machine Translation Task
EMNLP 2024
MARiA at SemEval 2024 Task-6: Hallucination Detection Through LLMs, MNLI, and Cosine similarity
SEMEVAL 2024
On the Worst Prompt Performance of Large Language Models
NIPS 2024
AMLB: an AutoML Benchmark
JMLR 2024
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
NIPS 2024
Navigating the Maze of Explainable AI: A Systematic Approach to Evaluating Methods and Metrics
NIPS 2024
CriticEval: Evaluating Large-scale Language Model as Critic
NIPS 2024
Paloma: A Benchmark for Evaluating Language Model Fit
NIPS 2024
Cousins Of The Vendi Score: A Family Of Similarity-Based Diversity Metrics For Science And Machine Learning
AISTATS 2024
POSIX: A Prompt Sensitivity Index For Large Language Models
EMNLP 2024
BenchIE^FL: A Manually Re-Annotated Fact-Based Open Information Extraction Benchmark
ACL 2024
A Cross-Domain Benchmark for Active Learning
NIPS 2024
<
1
…
7
8
9
…
21
>