← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

ConLoan: A Contrastive Multilingual Dataset for Evaluating Loanwords ACL 2025

M-IFEval: Multilingual Instruction-Following Evaluation NAACL 2025

CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models ACL 2025

SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science ACL 2025

WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging ACL 2025

Are Small Language Models Ready to Compete with Large Language Models for Practical Applications? NAACL 2025

CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models ACL 2025

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation ACL 2025

“What do you call a dog that is incontrovertibly true? Dogma”: Testing LLM Generalization through Humor ACL 2025

Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives IJCAI 2025

Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension ACL 2025

Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities ACL 2025

HumT DumT: Measuring and controlling human-like language in LLMs ACL 2025

Predicting Fine-tuned Performance on Larger Datasets Before Creating Them COLING 2025

Can External Validation Tools Improve Annotation Quality for LLM-as-a-Judge? ACL 2025

The Alternative Annotator Test for LLM-as-a-Judge: How to Statistically Justify Replacing Human Annotators with LLMs ACL 2025

Com2 : A Causal-Guided Benchmark for Exploring Complex Commonsense Reasoning in Large Language Models ACL 2025

LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios ACL 2025

Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling ACL 2025

Value Portrait: Assessing Language Models’ Values through Psychometrically and Ecologically Valid Items ACL 2025

Do not Abstain! Identify and Solve the Uncertainty ACL 2025

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation ACL 2025

BehaviorBox: Automated Discovery of Fine-Grained Performance Differences Between Language Models ACL 2025

A Reality Check on Context Utilisation for Retrieval-Augmented Generation ACL 2025

Not quite Sherlock Holmes: Language model predictions do not reliably differentiate impossible from improbable events ACL 2025