← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities ACL 2025

Bridging AI and Carbon Capture: A Dataset for LLMs in Ionic Liquids and CBE Research ACL 2025

ARGENT: Automatic Reference-free Evaluation for Open-Ended Text Generation without Source Inputs ACL 2025

Are LLMs (Really) Ideological? An IRT-based Analysis and Alignment Tool for Perceived Socio-Economic Bias in LLMs ACL 2025

Can LLMs Detect Intrinsic Hallucinations in Paraphrasing and Machine Translation? ACL 2025

Are Bias Evaluation Methods Biased ? ACL 2025

CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization ACL 2025

HuGME: A benchmark system for evaluating Hungarian generative LLMs ACL 2025

Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges ACL 2025

Investigating the Robustness of Retrieval-Augmented Generation at the Query Level ACL 2025

Evaluating the Quality of Benchmark Datasets for Low-Resource Languages: A Case Study on Turkish ACL 2025

PapersPlease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on ERG Theory ACL 2025

ReproHum #0031-01: Reproducing the Human Evaluation of Readability from “It is AI’s Turn to Ask Humans a Question” ACL 2025

ReproHum #0033-05: Human Evaluation of Factuality from A Multidisciplinary Perspective ACL 2025

ReproHum #0669-08: Reproducing Sentiment Transfer Evaluation ACL 2025

Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework ACL 2025

ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem Solving ACL 2025

Do Large Language Models Learn Human-Like Strategic Preferences? ACL 2025

FrontierScience Bench: Evaluating AI Research Capabilities in LLMs ACL 2025

DecepBench: Benchmarking Multimodal Deception Detection ACL 2025

PROTECT: Policy-Related Organizational Value Taxonomy for Ethical Compliance and Trust ACL 2025

Analyzing the Linguistic Priors of Language Models with Synthetic Languages ACL 2025

Something’s Fishy in the Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks ACL 2025

Ask Me Like I’m Human: LLM-based Evaluation with For-Human Instructions Correlates Better with Human Evaluations than Human Judges ACL 2025

MultiLogicNMR(er): A Benchmark and Neural-Symbolic Framework for Non-monotonic Reasoning with Multiple Extensions EMNLP 2025