Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Optimization & Theory
Machine Learning
›
Optimization & Theory
›
Evaluation
515 directly classified papers
Papers per year
2003: 1
2004: 1
2005: 1
2006: 1
2008: 2
2009: 1
2010: 1
2013: 5
2016: 3
2017: 8
2018: 11
2019: 24
2020: 25
2021: 34
2022: 68
2023: 74
2024: 105
2025: 147
2026: 3
Papers
RepMatch: Quantifying Cross-Instance Similarities in Representation Space
EMNLP 2024
Do LLMs Overcome Shortcut Learning? An Evaluation of Shortcut Challenges in Large Language Models
EMNLP 2024
DataTales: A Benchmark for Real-World Intelligent Data Narration
EMNLP 2024
Navigating the Metrics Maze: Reconciling Score Magnitudes and Accuracies
ACL 2024
Assessing “Implicit” Retrieval Robustness of Large Language Models
EMNLP 2024
Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models
EMNLP 2024
A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners
EMNLP 2024
Forgetting Curve: A Reliable Method for Evaluating Memorization Capability for Long-Context Models
EMNLP 2024
Open Ko-LLM Leaderboard: Evaluating Large Language Models in Korean with Ko-H5 Benchmark
ACL 2024
A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models
EMNLP 2024
CUTE: Measuring LLMs’ Understanding of Their Tokens
EMNLP 2024
Calibrating the Confidence of Large Language Models by Eliciting Fidelity
EMNLP 2024
Evaluating Large Language Models via Linguistic Profiling
EMNLP 2024
Cousins Of The Vendi Score: A Family Of Similarity-Based Diversity Metrics For Science And Machine Learning
AISTATS 2024
In Search of the Long-Tail: Systematic Generation of Long-Tail Inferential Knowledge via Logical Rule Guided Search
EMNLP 2024
Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?
EMNLP 2024
Uncertainty in Language Models: Assessment through Rank-Calibration
EMNLP 2024
MedCalc-Bench: Evaluating Large Language Models for Medical Calculations
NIPS 2024
AMLB: an AutoML Benchmark
JMLR 2024
Probing Language Models for Pre-training Data Detection
ACL 2024
NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes
ACL 2024
CriticEval: Evaluating Large-scale Language Model as Critic
NIPS 2024
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
NIPS 2024
Full Bayesian Significance Testing for Neural Networks
AAAI 2024
On the Worst Prompt Performance of Large Language Models
NIPS 2024
<
1
…
6
7
8
…
21
>