Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios
ACL 2025
LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
ACL 2025
RoDEval: A Robust Word Sense Disambiguation Evaluation Framework for Large Language Models
EMNLP 2025
Beyond Context to Cognitive Appraisal: Emotion Reasoning as a Theory of Mind Benchmark for Large Language Models
ACL 2025
Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling
ACL 2025
LLMs are Biased Evaluators But Not Biased for Fact-Centric Retrieval Augmented Generation
ACL 2025
Noise-Aware Evaluation of Object Detectors
WACV 2025
skLEP: A Slovak General Language Understanding Benchmark
ACL 2025
Value Portrait: Assessing Language Models’ Values through Psychometrically and Ecologically Valid Items
ACL 2025
HATS : Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models
ACL 2025
BasqBBQ: A QA Benchmark for Assessing Social Biases in LLMs for Basque, a Low-Resource Language
COLING 2025
Practical Solutions to Practical Problems in Developing Argument Mining Systems
ACL 2025
Do not Abstain! Identify and Solve the Uncertainty
ACL 2025
LLM-based post-editing as reference-free GEC evaluation
ACL 2025
InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles
EMNLP 2025
RCScore: Quantifying Response Consistency in Large Language Models
EMNLP 2025
OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain
EMNLP 2025
Arena-lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons
EMNLP 2025
NOVA-63: Native Omni-lingual Versatile Assessments of 63 Disciplines
EMNLP 2025
Confounding Factors in Relating Model Performance to Morphology
EMNLP 2025
UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models
EMNLP 2025
PoSum-Bench: Benchmarking Position Bias in LLM-based Conversational Summarization
EMNLP 2025
We Need to Measure Data Diversity in NLP — Better and Broader
EMNLP 2025
FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs’ Responsiveness to Human Feedback
EMNLP 2025
FinNLP-FNP-LLMFinLegal-2025 Shared Task: Financial Misinformation Detection Challenge Task
COLING 2025
<
1
…
7
8
9
…
67
>