Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing LLMs
ACL 2025
D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Models
ACL 2025
MALAMUTE: A Multilingual, Highly-granular, Template-free, Education-based Probing Dataset
ACL 2025
Random Splitting Negatively Impacts NER Evaluation: Quantifying and Eliminating the Overestimation of NER Performance
ACL 2025
A Conformal Risk Control Framework for Granular Word Assessment and Uncertainty Calibration of CLIPScore Quality Estimates
ACL 2025
A Tale of Evaluating Factual Consistency: Case Study on Long Document Summarization Evaluation
ACL 2025
Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events
ACL 2025
Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA
ACL 2025
LADDER: Language-Driven Slice Discovery and Error Rectification in Vision Classifiers
ACL 2025
REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research?
ACL 2025
Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated
ACL 2025
The Million Authors Corpus: A Cross-Lingual and Cross-Domain Wikipedia Dataset for Authorship Verification
ACL 2025
LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
ACL 2025
Beyond Context to Cognitive Appraisal: Emotion Reasoning as a Theory of Mind Benchmark for Large Language Models
ACL 2025
LLMs are Biased Evaluators But Not Biased for Fact-Centric Retrieval Augmented Generation
ACL 2025
skLEP: A Slovak General Language Understanding Benchmark
ACL 2025
HATS : Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models
ACL 2025
Practical Solutions to Practical Problems in Developing Argument Mining Systems
ACL 2025
LLM-based post-editing as reference-free GEC evaluation
ACL 2025
Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?
ACL 2025
Don’t Score too Early! Evaluating Argument Mining Models on Incomplete Essays
ACL 2025
Comparing Behavioral Patterns of LLM and Human Tutors: A Population-level Analysis with the CIMA Dataset
ACL 2025
Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors
ACL 2025
Few-Shot Prompting, Full-Scale Confusion: Evaluating Large Language Models for Humor Detection in Croatian Tweets
ACL 2025
Confounding Factors in Relating Model Performance to Morphology
EMNLP 2025
<
1
2
3
4
5
…
67
>