← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation EMNLP 2025

Calibrating LLM Confidence by Probing Perturbed Representation Stability EMNLP 2025

LLMs cannot spot math errors, even when allowed to peek into the solution EMNLP 2025

Long-Form Information Alignment Evaluation Beyond Atomic Facts EMNLP 2025

SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages? EMNLP 2025

Towards Optimal Evaluation Efficiency for Large Language Models EMNLP 2025

Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation EMNLP 2025

SSA: Semantic Contamination of LLM-Driven Fake News Detection EMNLP 2025

DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors EMNLP 2025

CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists EMNLP 2025

Can LLMs simulate the same correct solutions to free-response math problems as real students? EMNLP 2025

TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs EMNLP 2025

Graders Should Cheat: Privileged Information Enables Expert-Level Automated Evaluations EMNLP 2025

Are Large Language Models Chronically Online Surfers? A Dataset for Chinese Internet Meme Explanation EMNLP 2025

RoDEval: A Robust Word Sense Disambiguation Evaluation Framework for Large Language Models EMNLP 2025

SciEvent: Benchmarking Multi-domain Scientific Event Extraction EMNLP 2025

Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios EMNLP 2025

Evaluating and Aligning Human Economic Risk Preferences in LLMs EMNLP 2025

STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models EMNLP 2025

MultiLogicNMR(er): A Benchmark and Neural-Symbolic Framework for Non-monotonic Reasoning with Multiple Extensions EMNLP 2025

Women, Infamous, and Exotic Beings: A Comparative Study of Honorific Usages in Wikipedia and LLMs for Bengali and Hindi EMNLP 2025

Can LLMs Generate and Solve Linguistic Olympiad Puzzles? EMNLP 2025

MolErr2Fix: Benchmarking LLM Trustworthiness in Chemistry via Modular Error Detection, Localization, Explanation, and Correction EMNLP 2025

Shared Path: Unraveling Memorization in Multilingual LLMs through Language Similarities EMNLP 2025

Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference NAACL 2025