← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation ACL 2025

CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation ACL 2025

Disentangling Language and Culture for Evaluating Multilingual Large Language Models ACL 2025

CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models ACL 2025

Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension ACL 2025

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? ACL 2025

Do not Abstain! Identify and Solve the Uncertainty ACL 2025

The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs ACL 2025

Value Portrait: Assessing Language Models’ Values through Psychometrically and Ecologically Valid Items ACL 2025

Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling ACL 2025

BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models ACL 2025

A Reality Check on Context Utilisation for Retrieval-Augmented Generation ACL 2025

M-IFEval: Multilingual Instruction-Following Evaluation NAACL 2025

AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark ACL 2025

Are Small Language Models Ready to Compete with Large Language Models for Practical Applications? NAACL 2025

Com2 : A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language Models ACL 2025

Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives IJCAI 2025

Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge? ACL 2025

Predicting Fine-tuned Performance on Larger Datasets Before Creating Them COLING 2025

“Stupid robot, I want to speak to a human!” User Frustration Detection in Task-Oriented Dialog Systems COLING 2025

LLM Evaluate: An Industry-Focused Evaluation Tool for Large Language Models COLING 2025

LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios ACL 2025

Semantic Masking in a Needle-in-a-haystack Test for Evaluating Large Language Model Long-Text Capabilities COLING 2025

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation ACL 2025

“What do you call a dog that is incontrovertibly true? Dogma”: Testing LLM Generalization through Humor ACL 2025