← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? ACL 2025

Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension ACL 2025

CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models ACL 2025

Disentangling Language and Culture for Evaluating Multilingual Large Language Models ACL 2025

CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation ACL 2025

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation ACL 2025

“What do you call a dog that is incontrovertibly true? Dogma”: Testing LLM Generalization through Humor ACL 2025

HalluLens: LLM Hallucination Benchmark ACL 2025

Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception ACL 2025

Can Language Models Replace Programmers for Coding? REPOCOD Says ‘Not Yet’ ACL 2025

Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities ACL 2025

HumT DumT: Measuring and controlling human-like language in LLMs ACL 2025

ChatBench: From Static Benchmarks to Human-AI Evaluation ACL 2025

Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMs ACL 2025

QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation ACL 2025

InductionBench: LLMs Fail in the Simplest Complexity Class ACL 2025

Towards Robust Universal Information Extraction: Dataset, Evaluation, and Solution ACL 2025

ConLoan: A Contrastive Multilingual Dataset for Evaluating Loanwords ACL 2025

SubLIME: Subset Selection via Rank Correlation Prediction for Data-Efficient LLM Evaluation ACL 2025

ETF: An Entity Tracing Framework for Hallucination Detection in Code Summaries ACL 2025

CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models ACL 2025

Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories ACL 2025

SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science ACL 2025

WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging ACL 2025

CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization ACL 2025