Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Optimization & Theory
Deep Learning
›
Optimization & Theory
›
Evaluation
345 directly classified papers
Papers per year
2014: 1
2016: 3
2017: 1
2018: 9
2019: 21
2020: 34
2021: 32
2022: 50
2023: 28
2024: 90
2025: 76
Papers
MIBench: Evaluating Multimodal Large Language Models over Multiple Images
EMNLP 2024
Assessing and Verifying Task Utility in LLM-Powered Applications
EMNLP 2024
Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies
EMNLP 2024
Re-Evaluating Evaluation for Multilingual Summarization
EMNLP 2024
GuardBench: A Large-Scale Benchmark for Guardrail Models
EMNLP 2024
MetaMetrics-MT: Tuning Meta-Metrics for Machine Translation via Human Preference Calibration
EMNLP 2024
Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards
EMNLP 2024
VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation
EMNLP 2024
POSIX: A Prompt Sensitivity Index For Large Language Models
EMNLP 2024
Downstream Trade-offs of a Family of Text Watermarks
EMNLP 2024
TOWER: Tree Organized Weighting for Evaluating Complex Instructions
EMNLP 2024
MVP-Bench: Can Large Vision-Language Models Conduct Multi-level Visual Perception Like Humans?
EMNLP 2024
On Leakage of Code Generation Evaluation Datasets
EMNLP 2024
Compare without Despair: Reliable Preference Evaluation with Generation Separability
EMNLP 2024
TuringQ: Benchmarking AI Comprehension in Theory of Computation
EMNLP 2024
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation
EMNLP 2024
Easy to Decide, Hard to Agree: Reducing Disagreements Between Saliency Methods
ACL 2023
MISMATCH: Fine-grained Evaluation of Machine-generated Text with Mismatch Error Types
ACL 2023
A Better Way to Do Masked Language Model Scoring
ACL 2023
ReCode: Robustness Evaluation of Code Generation Models
ACL 2023
What’s the Meaning of Superhuman Performance in Today’s NLU?
ACL 2023
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation
ACL 2023
Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale
ACL 2023
On “Scientific Debt” in NLP: A Case for More Rigour in Language Model Pre-Training Research
ACL 2023
On the Evaluation of Neural Selective Prediction Methods for Natural Language Processing
ACL 2023
<
1
…
6
7
8
…
14
>