Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation
ACL 2025
CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation
ACL 2025
Disentangling Language and Culture for Evaluating Multilingual Large Language Models
ACL 2025
CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models
ACL 2025
Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension
ACL 2025
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
ACL 2025
Do not Abstain! Identify and Solve the Uncertainty
ACL 2025
The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs
ACL 2025
Value Portrait: Assessing Language Models’ Values through Psychometrically and Ecologically Valid Items
ACL 2025
Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling
ACL 2025
BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models
ACL 2025
A Reality Check on Context Utilisation for Retrieval-Augmented Generation
ACL 2025
M-IFEval: Multilingual Instruction-Following Evaluation
NAACL 2025
AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark
ACL 2025
Are Small Language Models Ready to Compete with Large Language Models for Practical Applications?
NAACL 2025
Com2 : A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language Models
ACL 2025
Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives
IJCAI 2025
Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?
ACL 2025
Predicting Fine-tuned Performance on Larger Datasets Before Creating Them
COLING 2025
“Stupid robot, I want to speak to a human!” User Frustration Detection in Task-Oriented Dialog Systems
COLING 2025
LLM Evaluate: An Industry-Focused Evaluation Tool for Large Language Models
COLING 2025
LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios
ACL 2025
Semantic Masking in a Needle-in-a-haystack Test for Evaluating Large Language Model Long-Text Capabilities
COLING 2025
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
ACL 2025
“What do you call a dog that is incontrovertibly true? Dogma”: Testing LLM Generalization through Humor
ACL 2025
<
1
2
3
4
5
…
67
>