← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing LLMs ACL 2025

D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Models ACL 2025

MALAMUTE: A Multilingual, Highly-granular, Template-free, Education-based Probing Dataset ACL 2025

Random Splitting Negatively Impacts NER Evaluation: Quantifying and Eliminating the Overestimation of NER Performance ACL 2025

A Conformal Risk Control Framework for Granular Word Assessment and Uncertainty Calibration of CLIPScore Quality Estimates ACL 2025

A Tale of Evaluating Factual Consistency: Case Study on Long Document Summarization Evaluation ACL 2025

Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events ACL 2025

Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA ACL 2025

LADDER: Language-Driven Slice Discovery and Error Rectification in Vision Classifiers ACL 2025

REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research? ACL 2025

Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated ACL 2025

The Million Authors Corpus: A Cross-Lingual and Cross-Domain Wikipedia Dataset for Authorship Verification ACL 2025

LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation ACL 2025

Beyond Context to Cognitive Appraisal: Emotion Reasoning as a Theory of Mind Benchmark for Large Language Models ACL 2025

LLMs are Biased Evaluators But Not Biased for Fact-Centric Retrieval Augmented Generation ACL 2025

skLEP: A Slovak General Language Understanding Benchmark ACL 2025

HATS : Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models ACL 2025

Practical Solutions to Practical Problems in Developing Argument Mining Systems ACL 2025

LLM-based post-editing as reference-free GEC evaluation ACL 2025

Do LLMs Give Psychometrically Plausible Responses in Educational Assessments? ACL 2025

Don’t Score too Early! Evaluating Argument Mining Models on Incomplete Essays ACL 2025

Comparing Behavioral Patterns of LLM and Human Tutors: A Population-level Analysis with the CIMA Dataset ACL 2025

Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors ACL 2025

Few-Shot Prompting, Full-Scale Confusion: Evaluating Large Language Models for Humor Detection in Croatian Tweets ACL 2025

Confounding Factors in Relating Model Performance to Morphology EMNLP 2025