← Optimization & Theory

Deep Learning › Optimization & Theory ›

Evaluation

345 directly classified papers

Papers per year

Papers

Data Laundering: Artificially Boosting Benchmark Results through Knowledge Distillation ACL 2025

Redundancy Principles for MLLMs Benchmarks ACL 2025

JuStRank: Benchmarking LLM Judges for System Ranking ACL 2025

When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models ACL 2025

Praetor: A Fine-Grained Generative LLM Evaluator with Instance-Level Customizable Evaluation Criteria ACL 2025

GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents ACL 2025

CLEME2.0: Towards Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction ACL 2025

Exploring Activation Patterns of Parameters in Language Models AAAI 2025

Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework for Evaluating Gender Diversity in Large Language Models ACL 2025

Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models ACL 2025

TripleFact: Defending Data Contamination in the Evaluation of LLM-driven Fake News Detection ACL 2025

A Unifying Information-theoretic Perspective on Evaluating Generative Models AAAI 2025

Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching CVPR 2025

RESQUE: Quantifying Estimator to Task and Distribution Shift for Sustainable Model Reusability AAAI 2025

How Not to Stitch Representations to Measure Similarity: Task Loss Matching Versus Direct Matching AAAI 2025

FROC: Building Fair ROC from a Trained Classifier AAAI 2025

SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses AAAI 2025

Training on the Benchmark Is Not All You Need AAAI 2025

Hypergraph Attacks via Injecting Homogeneous Nodes into Elite Hyperedges AAAI 2025

TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos ACL 2025

Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above ACL 2025

L-CiteEval: A Suite for Evaluating Fidelity of Long-context Models ACL 2025

Probing the Mid-level Vision Capabilities of Self-Supervised Learning CVPR 2025

Towards Precise Prediction Uncertainty in GNNs: Refining GNNs with Topology-grouping Strategy AAAI 2025

LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios ACL 2025