← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

Measuring scalar constructs in social science with LLMs EMNLP 2025

Not What the Doctor Ordered: Surveying LLM-based De-identification and Quantifying Clinical Information Loss EMNLP 2025

Africa Health Check: Probing Cultural Bias in Medical LLMs EMNLP 2025

ThinkSLM: Towards Reasoning in Small Language Models EMNLP 2025

Batched Self-Consistency Improves LLM Relevance Assessment and Ranking EMNLP 2025

AraEval: An Arabic Multi-Task Evaluation Suite for Large Language Models EMNLP 2025

Benchmarking and Mitigating MCQA Selection Bias of Large Vision-Language Models EMNLP 2025

SYNC: A Synthetic Long-Context Understanding Benchmark for Controlled Comparisons of Model Capabilities EMNLP 2025

LoCt-Instruct: An Automatic Pipeline for Constructing Datasets of Logical Continuous Instructions EMNLP 2025

The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs EMNLP 2025

SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants? EMNLP 2025

Synthetic Data for Evaluation: Supporting LLM-as-a-Judge Workflows with EvalAssist EMNLP 2025

AgentDiagnose: An Open Toolkit for Diagnosing LLM Agent Trajectories EMNLP 2025

PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation EMNLP 2025

o-MEGA: Optimized Methods for Explanation Generation and Analysis EMNLP 2025

TruthTorchLM: A Comprehensive Library for Predicting Truthfulness in LLM Outputs EMNLP 2025

SAGE: A Generic Framework for LLM Safety Evaluation EMNLP 2025

Truth, Trust, and Trouble: Medical AI on the Edge EMNLP 2025

InstaJudge: Aligning Judgment Bias of LLM-as-Judge with Humans in Industry Applications EMNLP 2025

From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes EMNLP 2025

Time Is Effort: Estimating Human Post-Editing Time for Grammar Error Correction Tool Evaluation EMNLP 2025

Towards Robust Universal Information Extraction: Dataset, Evaluation, and Solution ACL 2025

SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation CVPR 2025

AveniBench: Accessible and Versatile Evaluation of Finance Intelligence COLING 2025

A Unified Interpretation of Training-Time Out-of-Distribution Detection ICCV 2025