← Optimization & Theory

Machine Learning › Optimization & Theory ›

Evaluation

515 directly classified papers

Papers per year

Papers

Making a Long Story Short in Conversation Modeling EACL 2024

Fréchet Distance for Offline Evaluation of Information Retrieval Systems with Sparse Labels EACL 2024

Approximate Leave-one-out Cross Validation for Regression with $\ell_1$ Regularizers AISTATS 2024

Towards a new Benchmark for Emotion Detection in NLP: A Unifying Framework of Recent Corpora EMNLP 2024

NPHardEval: Dynamic Benchmark on Reasoning Ability of Large Language Models via Complexity Classes ACL 2024

Marathon: A Race Through the Realm of Long Context with Large Language Models ACL 2024

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark NIPS 2024

Benchmark Data Repositories for Better Benchmarking NIPS 2024

Is Cross-validation the Gold Standard to Estimate Out-of-sample Model Performance? NIPS 2024

ANAH: Analytical Annotation of Hallucinations in Large Language Models ACL 2024

TabularBench: Benchmarking Adversarial Robustness for Tabular Deep Learning in Real-world Use-cases NIPS 2024

AgentBoard: An Analytical Evaluation Board of Multi-turn LLM Agents NIPS 2024

Adaptive Labeling for Efficient Out-of-distribution Model Evaluation NIPS 2024

MSLC24 Submissions to the General Machine Translation Task EMNLP 2024

MARiA at SemEval 2024 Task-6: Hallucination Detection Through LLMs, MNLI, and Cosine similarity SEMEVAL 2024

On the Worst Prompt Performance of Large Language Models NIPS 2024

AMLB: an AutoML Benchmark JMLR 2024

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress? NIPS 2024

Navigating the Maze of Explainable AI: A Systematic Approach to Evaluating Methods and Metrics NIPS 2024

CriticEval: Evaluating Large-scale Language Model as Critic NIPS 2024

Paloma: A Benchmark for Evaluating Language Model Fit NIPS 2024

Cousins Of The Vendi Score: A Family Of Similarity-Based Diversity Metrics For Science And Machine Learning AISTATS 2024

POSIX: A Prompt Sensitivity Index For Large Language Models EMNLP 2024

BenchIE^FL: A Manually Re-Annotated Fact-Based Open Information Extraction Benchmark ACL 2024

A Cross-Domain Benchmark for Active Learning NIPS 2024