← Optimization & Theory

Deep Learning › Optimization & Theory ›

Evaluation

345 directly classified papers

Papers per year

Papers

U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in Large Language Models ACL 2025

Hypergraph Attacks via Injecting Homogeneous Nodes into Elite Hyperedges AAAI 2025

How Not to Stitch Representations to Measure Similarity: Task Loss Matching Versus Direct Matching AAAI 2025

A Unifying Information-theoretic Perspective on Evaluating Generative Models AAAI 2025

RESQUE: Quantifying Estimator to Task and Distribution Shift for Sustainable Model Reusability AAAI 2025

Towards Precise Prediction Uncertainty in GNNs: Refining GNNs with Topology-grouping Strategy AAAI 2025

SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses AAAI 2025

Training on the Benchmark Is Not All You Need AAAI 2025

Exploring Activation Patterns of Parameters in Language Models AAAI 2025

FROC: Building Fair ROC from a Trained Classifier AAAI 2025

Redundancy Principles for MLLMs Benchmarks ACL 2025

Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration ACL 2025

Towards Harmonized Uncertainty Estimation for Large Language Models ACL 2025

JuStRank: Benchmarking LLM Judges for System Ranking ACL 2025

Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework for Evaluating Gender Diversity in Large Language Models ACL 2025

ConsistencyChecker: Tree-based Evaluation of LLM Generalization Capabilities ACL 2025

Optimization before Evaluation: Evaluation with Unoptimized Prompts Can be Misleading ACL 2025

TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos ACL 2025

ARC ‘Challenge’ Is Not That Challenging ACL 2025

Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and Reliability ACL 2025

LR²Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems ACL 2025

Probing the Mid-level Vision Capabilities of Self-Supervised Learning CVPR 2025

Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models ACL 2025

L-CiteEval: A Suite for Evaluating Fidelity of Long-context Models ACL 2025

IDEA-Bench: How Far are Generative Models from Professional Designing? CVPR 2025