Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
ConLoan: A Contrastive Multilingual Dataset for Evaluating Loanwords
ACL 2025
M-IFEval: Multilingual Instruction-Following Evaluation
NAACL 2025
CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models
ACL 2025
SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science
ACL 2025
WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging
ACL 2025
Are Small Language Models Ready to Compete with Large Language Models for Practical Applications?
NAACL 2025
CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models
ACL 2025
MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation
ACL 2025
“What do you call a dog that is incontrovertibly true? Dogma”: Testing LLM Generalization through Humor
ACL 2025
Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives
IJCAI 2025
Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension
ACL 2025
Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities
ACL 2025
HumT DumT: Measuring and controlling human-like language in LLMs
ACL 2025
Predicting Fine-tuned Performance on Larger Datasets Before Creating Them
COLING 2025
Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge?
ACL 2025
The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs
ACL 2025
Com2 : A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language Models
ACL 2025
LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios
ACL 2025
Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling
ACL 2025
Value Portrait: Assessing Language Models’ Values through Psychometrically and Ecologically Valid Items
ACL 2025
Do not Abstain! Identify and Solve the Uncertainty
ACL 2025
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
ACL 2025
BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models
ACL 2025
A Reality Check on Context Utilisation for Retrieval-Augmented Generation
ACL 2025
Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events
ACL 2025
<
1
2
3
4
5
…
67
>