← Optimization & Theory

Machine Learning › Optimization & Theory ›

Evaluation

515 directly classified papers

Papers per year

Papers

Paloma: A Benchmark for Evaluating Language Model Fit NIPS 2024

A Cross-Domain Benchmark for Active Learning NIPS 2024

ReMI: A Dataset for Reasoning with Multiple Images NIPS 2024

ANAH-v2: Scaling Analytical Hallucination Annotation of Large Language Models NIPS 2024

PANORAMIA: Privacy Auditing of Machine Learning Models without Retraining NIPS 2024

SETLEXSEM CHALLENGE: Using Set Operations to Evaluate the Lexical and Semantic Robustness of Language Models NIPS 2024

RoleAgent: Building, Interacting, and Benchmarking High-quality Role-Playing Agents from Scripts NIPS 2024

MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models NIPS 2024

A Benchmark Suite for Evaluating Neural Mutual Information Estimators on Unstructured Datasets NIPS 2024

Towards a new Benchmark for Emotion Detection in NLP: A Unifying Framework of Recent Corpora EMNLP 2024

I2EBench: A Comprehensive Benchmark for Instruction-based Image Editing NIPS 2024

Test Where Decisions Matter: Importance-driven Testing for Deep Reinforcement Learning NIPS 2024

RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content NIPS 2024

Understanding the Effects of Noise in Text-to-SQL: An Examination of the BIRD-Bench Benchmark ACL 2024

Efficient multi-prompt evaluation of LLMs NIPS 2024

BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices NIPS 2024

Overcoming Common Flaws in the Evaluation of Selective Classification Systems NIPS 2024

DiLiGenRT: A Photometric Stereo Dataset with Quantified Roughness and Translucency CVPR 2024

Meta-Evaluation of Sentence Simplification Metrics COLING 2024

Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards EMNLP 2024

A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions CVPR 2024

On the Content Bias in Frechet Video Distance CVPR 2024

VBench: Comprehensive Benchmark Suite for Video Generative Models CVPR 2024

Localization Is All You Evaluate: Data Leakage in Online Mapping Datasets and How to Fix It CVPR 2024

Towards a Perceptual Evaluation Framework for Lighting Estimation CVPR 2024