← Optimization & Theory

Machine Learning › Optimization & Theory ›

Evaluation

515 directly classified papers

Papers per year

Papers

BOSE: A Systematic Evaluation Method Optimized for Base Models ACL 2025

100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability? ACL 2025

Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models ACL 2025

Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses EMNLP 2025

Reproducing the Argument Quality Prediction of Project Debater ACL 2025

Can LLMs Reliably Simulate Real Students’ Abilities in Mathematics and Reading Comprehension? ACL 2025

LLMs as Medical Safety Judges: Evaluating Alignment with Human Annotation in Patient-Facing QA ACL 2025

Noise-Aware Evaluation of Object Detectors WACV 2025

KazBench-KK: A Cultural-Knowledge Benchmark for Kazakh ACL 2025

Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons ACL 2025

Evaluating LLMs with Multiple Problems at once ACL 2025

(Towards) Scalable Reliable Automated Evaluation with Large Language Models ACL 2025

ReproHum #0744-02: A Reproduction of the Human Evaluation of Meaning Preservation in “Factorising Meaning and Form for Intent-Preserving Paraphrasing” ACL 2025

Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics ACL 2025

An Analysis of Datasets, Metrics and Models in Keyphrase Generation ACL 2025

Findings of the IWSLT 2025 Evaluation Campaign ACL 2025

Factual Knowledge in Language Models: Robustness and Anomalies under Simple Temporal Context Variations ACL 2025

The UNLP 2025 Shared Task on Detecting Social Media Manipulation ACL 2025

Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon EMNLP 2025

PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning ACL 2025

M-IFEval: Multilingual Instruction-Following Evaluation NAACL 2025

Logical forms complement probability in understanding language model (and human) performance ACL 2025

From Parameters to Performance: A Data-Driven Study on LLM Structure and Development EMNLP 2025

Can’t See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs ACL 2025

Single Ground Truth Is Not Enough: Adding Flexibility to Aspect-Based Sentiment Analysis Evaluation NAACL 2025