Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Optimization & Theory
Deep Learning
›
Optimization & Theory
›
Evaluation
345 directly classified papers
Papers per year
2014: 1
2016: 3
2017: 1
2018: 9
2019: 21
2020: 34
2021: 32
2022: 50
2023: 28
2024: 90
2025: 76
Papers
U-MATH: A University-Level Benchmark for Evaluating Mathematical Skills in Large Language Models
ACL 2025
Hypergraph Attacks via Injecting Homogeneous Nodes into Elite Hyperedges
AAAI 2025
How Not to Stitch Representations to Measure Similarity: Task Loss Matching Versus Direct Matching
AAAI 2025
A Unifying Information-theoretic Perspective on Evaluating Generative Models
AAAI 2025
RESQUE: Quantifying Estimator to Task and Distribution Shift for Sustainable Model Reusability
AAAI 2025
Towards Precise Prediction Uncertainty in GNNs: Refining GNNs with Topology-grouping Strategy
AAAI 2025
SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses
AAAI 2025
Training on the Benchmark Is Not All You Need
AAAI 2025
Exploring Activation Patterns of Parameters in Language Models
AAAI 2025
FROC: Building Fair ROC from a Trained Classifier
AAAI 2025
Redundancy Principles for MLLMs Benchmarks
ACL 2025
Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration
ACL 2025
Towards Harmonized Uncertainty Estimation for Large Language Models
ACL 2025
JuStRank: Benchmarking LLM Judges for System Ranking
ACL 2025
Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework for Evaluating Gender Diversity in Large Language Models
ACL 2025
ConsistencyChecker: Tree-based Evaluation of LLM Generalization Capabilities
ACL 2025
Optimization before Evaluation: Evaluation with Unoptimized Prompts Can be Misleading
ACL 2025
TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos
ACL 2025
ARC ‘Challenge’ Is Not That Challenging
ACL 2025
Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and Reliability
ACL 2025
LR²Bench: Evaluating Long-chain Reflective Reasoning Capabilities of Large Language Models via Constraint Satisfaction Problems
ACL 2025
Probing the Mid-level Vision Capabilities of Self-Supervised Learning
CVPR 2025
Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models
ACL 2025
L-CiteEval: A Suite for Evaluating Fidelity of Long-context Models
ACL 2025
IDEA-Bench: How Far are Generative Models from Professional Designing?
CVPR 2025
<
1
2
3
4
5
…
14
>