← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation ACL 2025

Memorization ≠ Understanding: Do Large Language Models Have the Ability of Scenario Cognition? EMNLP 2025

SHARP: Unlocking Interactive Hallucination via Stance Transfer in Role-Playing LLMs ACL 2025

Shared Path: Unraveling Memorization in Multilingual LLMs through Language Similarities EMNLP 2025

D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Models ACL 2025

The Emperor’s New Reasoning: Format Imitation Overshadows Genuine Mathematical Understanding in SFT EMNLP 2025

MALAMUTE: A Multilingual, Highly-granular, Template-free, Education-based Probing Dataset ACL 2025

From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models EMNLP 2025

Random Splitting Negatively Impacts NER Evaluation: Quantifying and Eliminating the Overestimation of NER Performance ACL 2025

Predicting Fine-tuned Performance on Larger Datasets Before Creating Them COLING 2025

A Conformal Risk Control Framework for Granular Word Assessment and Uncertainty Calibration of CLIPScore Quality Estimates ACL 2025

Agent-as-Judge for Factual Summarization of Long Narratives EMNLP 2025

A Tale of Evaluating Factual Consistency: Case Study on Long Document Summarization Evaluation ACL 2025

Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge? ACL 2025

Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events ACL 2025

Not What the Doctor Ordered: Surveying LLM-based De-identification and Quantifying Clinical Information Loss EMNLP 2025

Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA ACL 2025

The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs ACL 2025

LADDER: Language-Driven Slice Discovery and Error Rectification in Vision Classifiers ACL 2025

AraEval: An Arabic Multi-Task Evaluation Suite for Large Language Models EMNLP 2025

REPRO-Bench: Can Agentic AI Systems Assess the Reproducibility of Social Science Research? ACL 2025

Com2 : A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language Models ACL 2025

Human Bias in the Face of AI: Examining Human Judgment Against Text Labeled as AI Generated ACL 2025

The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs EMNLP 2025

From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes EMNLP 2025