Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Optimization & Theory
Deep Learning
›
Optimization & Theory
›
Evaluation
345 directly classified papers
Papers per year
2014: 1
2016: 3
2017: 1
2018: 9
2019: 21
2020: 34
2021: 32
2022: 50
2023: 28
2024: 90
2025: 76
Papers
Data Laundering: Artificially Boosting Benchmark Results through Knowledge Distillation
ACL 2025
Redundancy Principles for MLLMs Benchmarks
ACL 2025
JuStRank: Benchmarking LLM Judges for System Ranking
ACL 2025
When the LM misunderstood the human chuckled: Analyzing garden path effects in humans and language models
ACL 2025
Praetor: A Fine-Grained Generative LLM Evaluator with Instance-Level Customizable Evaluation Criteria
ACL 2025
GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents
ACL 2025
CLEME2.0: Towards Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction
ACL 2025
Exploring Activation Patterns of Parameters in Language Models
AAAI 2025
Gender Inclusivity Fairness Index (GIFI): A Multilevel Framework for Evaluating Gender Diversity in Large Language Models
ACL 2025
Unsolvable Problem Detection: Robust Understanding Evaluation for Large Multimodal Models
ACL 2025
TripleFact: Defending Data Contamination in the Evaluation of LLM-driven Fake News Detection
ACL 2025
A Unifying Information-theoretic Perspective on Evaluating Generative Models
AAAI 2025
Image Over Text: Transforming Formula Recognition Evaluation with Character Detection Matching
CVPR 2025
RESQUE: Quantifying Estimator to Task and Distribution Shift for Sustainable Model Reusability
AAAI 2025
How Not to Stitch Representations to Measure Similarity: Task Loss Matching Versus Direct Matching
AAAI 2025
FROC: Building Fair ROC from a Trained Classifier
AAAI 2025
SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses
AAAI 2025
Training on the Benchmark Is Not All You Need
AAAI 2025
Hypergraph Attacks via Injecting Homogeneous Nodes into Elite Hyperedges
AAAI 2025
TUNA: Comprehensive Fine-grained Temporal Understanding Evaluation on Dense Dynamic Videos
ACL 2025
Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above
ACL 2025
L-CiteEval: A Suite for Evaluating Fidelity of Long-context Models
ACL 2025
Probing the Mid-level Vision Capabilities of Self-Supervised Learning
CVPR 2025
Towards Precise Prediction Uncertainty in GNNs: Refining GNNs with Topology-grouping Strategy
AAAI 2025
LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios
ACL 2025
<
1
2
3
4
5
…
14
>