Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Learning Types
Deep Learning
›
Learning Types
›
Evaluation
150 directly classified papers
Papers per year
2016: 1
2019: 4
2020: 3
2021: 9
2022: 11
2023: 19
2024: 40
2025: 63
Papers
A Unified Agentic Framework for Evaluating Conditional Image Generation
ACL 2025
CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models
ACL 2025
Mind the Gap: Static and Interactive Evaluations of Large Audio Models
ACL 2025
SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation
ACL 2025
LLMs can be easily Confused by Instructional Distractions
ACL 2025
The Invisible Hand: Unveiling Provider Bias in Large Language Models for Code Generation
ACL 2025
The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project
ACL 2025
Unanswerability Evaluation for Retrieval Augmented Generation
ACL 2025
Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models
ACL 2025
BelarusianGLUE: Towards a Natural Language Understanding Benchmark for Belarusian
ACL 2025
SCOP: Evaluating the Comprehension Process of Large Language Models from a Cognitive View
ACL 2025
Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents
ACL 2025
Gen3DEval: Using vLLMs for Automatic Evaluation of Generated 3D Objects
CVPR 2025
AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark
ACL 2025
RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation
EMNLP 2025
Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge
ACL 2025
How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach
ICCV 2025
Improving Model Factuality with Fine-grained Critique-based Evaluator
ACL 2025
LLM The Genius Paradox: A Linguistic and Math Expert’s Struggle with Simple Word-based Counting Problems
NAACL 2025
M-RewardBench: Evaluating Reward Models in Multilingual Settings
ACL 2025
A Comprehensive Evaluation on Event Reasoning of Large Language Models
AAAI 2025
FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models
ACL 2025
SEAL: Systematic Error Analysis for Value ALignment
AAAI 2025
EvolveBench: A Comprehensive Benchmark for Assessing Temporal Awareness in LLMs on Evolving Knowledge
ACL 2025
CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation
ACL 2025
<
1
2
3
4
5
6
>