Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
Measuring scalar constructs in social science with LLMs
EMNLP 2025
Not What the Doctor Ordered: Surveying LLM-based De-identification and Quantifying Clinical Information Loss
EMNLP 2025
Africa Health Check: Probing Cultural Bias in Medical LLMs
EMNLP 2025
ThinkSLM: Towards Reasoning in Small Language Models
EMNLP 2025
Batched Self-Consistency Improves LLM Relevance Assessment and Ranking
EMNLP 2025
AraEval: An Arabic Multi-Task Evaluation Suite for Large Language Models
EMNLP 2025
Benchmarking and Mitigating MCQA Selection Bias of Large Vision-Language Models
EMNLP 2025
SYNC: A Synthetic Long-Context Understanding Benchmark for Controlled Comparisons of Model Capabilities
EMNLP 2025
LoCt-Instruct: An Automatic Pipeline for Constructing Datasets of Logical Continuous Instructions
EMNLP 2025
The Illusion of Progress: Re-evaluating Hallucination Detection in LLMs
EMNLP 2025
SimulatorArena: Are User Simulators Reliable Proxies for Multi-Turn Evaluation of AI Assistants?
EMNLP 2025
Synthetic Data for Evaluation: Supporting LLM-as-a-Judge Workflows with EvalAssist
EMNLP 2025
AgentDiagnose: An Open Toolkit for Diagnosing LLM Agent Trajectories
EMNLP 2025
PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation
EMNLP 2025
o-MEGA: Optimized Methods for Explanation Generation and Analysis
EMNLP 2025
TruthTorchLM: A Comprehensive Library for Predicting Truthfulness in LLM Outputs
EMNLP 2025
SAGE: A Generic Framework for LLM Safety Evaluation
EMNLP 2025
Truth, Trust, and Trouble: Medical AI on the Edge
EMNLP 2025
InstaJudge: Aligning Judgment Bias of LLM-as-Judge with Humans in Industry Applications
EMNLP 2025
From Feedback to Checklists: Grounded Evaluation of AI-Generated Clinical Notes
EMNLP 2025
Time Is Effort: Estimating Human Post-Editing Time for Grammar Error Correction Tool Evaluation
EMNLP 2025
Towards Robust Universal Information Extraction: Dataset, Evaluation, and Solution
ACL 2025
SVLTA: Benchmarking Vision-Language Temporal Alignment via Synthetic Video Situation
CVPR 2025
AveniBench: Accessible and Versatile Evaluation of Finance Intelligence
COLING 2025
A Unified Interpretation of Training-Time Out-of-Distribution Detection
ICCV 2025
<
1
…
10
11
12
…
67
>