← Learning Types

Deep Learning › Learning Types ›

Evaluation

150 directly classified papers

Papers per year

Papers

A Unified Agentic Framework for Evaluating Conditional Image Generation ACL 2025

CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models ACL 2025

Mind the Gap: Static and Interactive Evaluations of Large Audio Models ACL 2025

SPHERE: Unveiling Spatial Blind Spots in Vision-Language Models Through Hierarchical Evaluation ACL 2025

LLMs can be easily Confused by Instructional Distractions ACL 2025

The Invisible Hand: Unveiling Provider Bias in Large Language Models for Code Generation ACL 2025

The UD-NewsCrawl Treebank: Reflections and Challenges from a Large-scale Tagalog Syntactic Annotation Project ACL 2025

Unanswerability Evaluation for Retrieval Augmented Generation ACL 2025

Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models ACL 2025

BelarusianGLUE: Towards a Natural Language Understanding Benchmark for Belarusian ACL 2025

SCOP: Evaluating the Comprehension Process of Large Language Models from a Cognitive View ACL 2025

Agent-RewardBench: Towards a Unified Benchmark for Reward Modeling across Perception, Planning, and Safety in Real-World Multimodal Agents ACL 2025

Gen3DEval: Using vLLMs for Automatic Evaluation of Generated 3D Objects CVPR 2025

AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark ACL 2025

RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation EMNLP 2025

Crowd Comparative Reasoning: Unlocking Comprehensive Evaluations for LLM-as-a-Judge ACL 2025

How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach ICCV 2025

Improving Model Factuality with Fine-grained Critique-based Evaluator ACL 2025

LLM The Genius Paradox: A Linguistic and Math Expert’s Struggle with Simple Word-based Counting Problems NAACL 2025

M-RewardBench: Evaluating Reward Models in Multilingual Settings ACL 2025

A Comprehensive Evaluation on Event Reasoning of Large Language Models AAAI 2025

FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models ACL 2025

SEAL: Systematic Error Analysis for Value ALignment AAAI 2025

EvolveBench: A Comprehensive Benchmark for Assessing Temporal Awareness in LLMs on Evolving Knowledge ACL 2025

CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation ACL 2025