Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation
ACL 2025
Memorization ≠ Understanding: Do Large Language Models Have the Ability of Scenario Cognition?
EMNLP 2025
SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing LLMs
ACL 2025
Shared Path: Unraveling Memorization in Multilingual LLMs through Language Similarities
EMNLP 2025
D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Models
ACL 2025
The Emperor’s New Reasoning: Format Imitation Overshadows Genuine Mathematical Understanding in SFT
EMNLP 2025
MALAMUTE: A Multilingual, Highly-granular, Template-free, Education-based Probing Dataset
ACL 2025
From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models
EMNLP 2025
Random Splitting Negatively Impacts NER Evaluation: Quantifying and Eliminating the Overestimation of NER Performance
ACL 2025
Predicting Fine-tuned Performance on Larger Datasets Before Creating Them
COLING 2025
A Conformal Risk Control Framework for Granular Word Assessment and Uncertainty Calibration of CLIPScore Quality Estimates
ACL 2025
Agent-as-Judge for Factual Summarization of Long Narratives
EMNLP 2025
A Tale of Evaluating Factual Consistency: Case Study on Long Document Summarization Evaluation
ACL 2025
Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?
ACL 2025
Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events
ACL 2025
Not What the Doctor Ordered: Surveying LLM-based De-identification and Quantifying Clinical Information Loss
EMNLP 2025
Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA
ACL 2025
The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs
ACL 2025
LADDER: Language-Driven Slice Discovery and Error Rectification in Vision Classifiers
ACL 2025
AraEval: An Arabic Multi-Task Evaluation Suite for Large Language Models
EMNLP 2025
REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?
ACL 2025
Com2 : A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language Models
ACL 2025
Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated
ACL 2025
The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs
EMNLP 2025
From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes
EMNLP 2025
<
1
…
6
7
8
…
67
>