← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet ACL 2025

ChatBench: From Static Benchmarks to Human-AI Evaluation ACL 2025

Are Small Language Models Ready to Compete with Large Language Models for Practical Applications? NAACL 2025

Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMs ACL 2025

SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages? EMNLP 2025

QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation ACL 2025

Graders Should Cheat: Privileged Information Enables Expert-Level Automated Evaluations EMNLP 2025

InductionBench: LLMs Fail in the Simplest Complexity Class ACL 2025

RoDEval: A Robust Word Sense Disambiguation Evaluation Framework for Large Language Models EMNLP 2025

Towards Robust Universal Information Extraction: Dataset, Evaluation, and Solution ACL 2025

NOVA-63: Native Omni-lingual Versatile Assessments of 63 Disciplines EMNLP 2025

ConLoan: A Contrastive Multilingual Dataset for Evaluating Loanwords ACL 2025

MultiLogicNMR(er): A Benchmark and Neural-Symbolic Framework for Non-monotonic Reasoning with Multiple Extensions EMNLP 2025

SubLIME: Subset Selection via Rank Correlation Prediction for Data-Efficient LLM Evaluation ACL 2025

InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles EMNLP 2025

ETF: An Entity Tracing Framework for Hallucination Detection in Code Summaries ACL 2025

RCScore: Quantifying Response Consistency in Large Language Models EMNLP 2025

CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models ACL 2025

Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives IJCAI 2025

Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories ACL 2025

UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models EMNLP 2025

SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science ACL 2025

We Need to Measure Data Diversity in NLP — Better and Broader EMNLP 2025

WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging ACL 2025

Do LLMs Behave as Claimed? Investigating How LLMs Follow Their Own Claims using Counterfactual Questions EMNLP 2025