← Optimization & Theory

Deep Learning › Optimization & Theory ›

Evaluation

345 directly classified papers

Papers per year

Papers

MIBench: Evaluating Multimodal Large Language Models over Multiple Images EMNLP 2024

Assessing and Verifying Task Utility in LLM-Powered Applications EMNLP 2024

Reasoning in Token Economies: Budget-Aware Evaluation of LLM Reasoning Strategies EMNLP 2024

Re-Evaluating Evaluation for Multilingual Summarization EMNLP 2024

GuardBench: A Large-Scale Benchmark for Guardrail Models EMNLP 2024

MetaMetrics-MT: Tuning Meta-Metrics for Machine Translation via Human Preference Calibration EMNLP 2024

Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards EMNLP 2024

VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation EMNLP 2024

POSIX: A Prompt Sensitivity Index For Large Language Models EMNLP 2024

Downstream Trade-offs of a Family of Text Watermarks EMNLP 2024

TOWER: Tree Organized Weighting for Evaluating Complex Instructions EMNLP 2024

MVP-Bench: Can Large Vision-Language Models Conduct Multi-level Visual Perception Like Humans? EMNLP 2024

On Leakage of Code Generation Evaluation Datasets EMNLP 2024

Compare without Despair: Reliable Preference Evaluation with Generation Separability EMNLP 2024

TuringQ: Benchmarking AI Comprehension in Theory of Computation EMNLP 2024

Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation EMNLP 2024

Easy to Decide, Hard to Agree: Reducing Disagreements Between Saliency Methods ACL 2023

MISMATCH: Fine-grained Evaluation of Machine-generated Text with Mismatch Error Types ACL 2023

A Better Way to Do Masked Language Model Scoring ACL 2023

ReCode: Robustness Evaluation of Code Generation Models ACL 2023

What’s the Meaning of Superhuman Performance in Today’s NLU? ACL 2023

On the Blind Spots of Model-Based Evaluation Metrics for Text Generation ACL 2023

Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale ACL 2023

On “Scientific Debt” in NLP: A Case for More Rigour in Language Model Pre-Training Research ACL 2023

On the Evaluation of Neural Selective Prediction Methods for Natural Language Processing ACL 2023