← Optimization & Theory

Deep Learning › Optimization & Theory ›

Evaluation

345 directly classified papers

Papers per year

Papers

Benchmarking Segmentation Models with Mask-Preserved Attribute Editing CVPR 2024

LaRE^2: Latent Reconstruction Error Based Method for Diffusion-Generated Image Detection CVPR 2024

Rethinking FID: Towards a Better Evaluation Metric for Image Generation CVPR 2024

FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models CVPR 2024

Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models EMNLP 2024

ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object CVPR 2024

VBench: Comprehensive Benchmark Suite for Video Generative Models CVPR 2024

VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models EMNLP 2024

Do LLMs Overcome Shortcut Learning? An Evaluation of Shortcut Challenges in Large Language Models EMNLP 2024

VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation EMNLP 2024

On Training Data Influence of GPT Models EMNLP 2024

MetaMetrics-MT: Tuning Meta-Metrics for Machine Translation via Human Preference Calibration EMNLP 2024

Forgetting Curve: A Reliable Method for Evaluating Memorization Capability for Long-Context Models EMNLP 2024

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations EMNLP 2024

GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation CVPR 2024

Downstream Trade-offs of a Family of Text Watermarks EMNLP 2024

LawBench: Benchmarking Legal Knowledge of Large Language Models EMNLP 2024

POSIX: A Prompt Sensitivity Index For Large Language Models EMNLP 2024

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation CVPR 2024

TOWER: Tree Organized Weighting for Evaluating Complex Instructions EMNLP 2024

The Instinctive Bias: Spurious Images lead to Illusion in MLLMs EMNLP 2024

L-Eval: Instituting Standardized Evaluation for Long Context Language Models ACL 2024

Scaling Laws of Synthetic Images for Model Training ... for Now CVPR 2024

Greed is All You Need: An Evaluation of Tokenizer Inference Methods ACL 2024

VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation EMNLP 2024