Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Optimization & Theory
Machine Learning
›
Optimization & Theory
›
Evaluation
515 directly classified papers
Papers per year
2003: 1
2004: 1
2005: 1
2006: 1
2008: 2
2009: 1
2010: 1
2013: 5
2016: 3
2017: 8
2018: 11
2019: 24
2020: 25
2021: 34
2022: 68
2023: 74
2024: 105
2025: 147
2026: 3
Papers
Paloma: A Benchmark for Evaluating Language Model Fit
NIPS 2024
A Cross-Domain Benchmark for Active Learning
NIPS 2024
ReMI: A Dataset for Reasoning with Multiple Images
NIPS 2024
ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models
NIPS 2024
PANORAMIA: Privacy Auditing of Machine Learning Models without Retraining
NIPS 2024
SETLEXSEM CHALLENGE: Using Set Operations to Evaluate the Lexical and Semantic Robustness of Language Models
NIPS 2024
RoleAgent: Building, Interacting, and Benchmarking High-quality Role-Playing Agents from Scripts
NIPS 2024
MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models
NIPS 2024
A Benchmark Suite for Evaluating Neural Mutual Information Estimators on Unstructured Datasets
NIPS 2024
Towards a new Benchmark for Emotion Detection in NLP: A Unifying Framework of Recent Corpora
EMNLP 2024
I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing
NIPS 2024
Test Where Decisions Matter: Importance-driven Testing for Deep Reinforcement Learning
NIPS 2024
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content
NIPS 2024
Understanding the Effects of Noise in Text-to-SQL: An Examination of the BIRD-Bench Benchmark
ACL 2024
Efficient multi-prompt evaluation of LLMs
NIPS 2024
BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices
NIPS 2024
Overcoming Common Flaws in the Evaluation of Selective Classification Systems
NIPS 2024
DiLiGenRT: A Photometric Stereo Dataset with Quantified Roughness and Translucency
CVPR 2024
Meta-Evaluation of Sentence Simplification Metrics
COLING 2024
Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards
EMNLP 2024
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions
CVPR 2024
On the Content Bias in Frechet Video Distance
CVPR 2024
VBench: Comprehensive Benchmark Suite for Video Generative Models
CVPR 2024
Localization Is All You Evaluate: Data Leakage in Online Mapping Datasets and How to Fix It
CVPR 2024
Towards a Perceptual Evaluation Framework for Lighting Estimation
CVPR 2024
<
1
…
8
9
10
…
21
>