Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
REPA: Russian Error Types Annotation for Evaluating Text Generation and Judgment Capabilities
ACL 2025
Bridging AI and Carbon Capture: A Dataset for LLMs in Ionic Liquids and CBE Research
ACL 2025
ARGENT: Automatic Reference-free Evaluation for Open-Ended Text Generation without Source Inputs
ACL 2025
Are LLMs (Really) Ideological? An IRT-based Analysis and Alignment Tool for Perceived Socio-Economic Bias in LLMs
ACL 2025
Can LLMs Detect Intrinsic Hallucinations in Paraphrasing and Machine Translation?
ACL 2025
Are Bias Evaluation Methods Biased ?
ACL 2025
CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization
ACL 2025
HuGME: A benchmark system for evaluating Hungarian generative LLMs
ACL 2025
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
ACL 2025
Investigating the Robustness of Retrieval-Augmented Generation at the Query Level
ACL 2025
Evaluating the Quality of Benchmark Datasets for Low-Resource Languages: A Case Study on Turkish
ACL 2025
PapersPlease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on ERG Theory
ACL 2025
ReproHum #0031-01: Reproducing the Human Evaluation of Readability from “It is AI’s Turn to Ask Humans a Question”
ACL 2025
ReproHum #0033-05: Human Evaluation of Factuality from A Multidisciplinary Perspective
ACL 2025
ReproHum #0669-08: Reproducing Sentiment Transfer Evaluation
ACL 2025
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework
ACL 2025
ArithmAttack: Evaluating Robustness of LLMs to Noisy Context in Math Problem Solving
ACL 2025
Do Large Language Models Learn Human-Like Strategic Preferences?
ACL 2025
FrontierScience Bench: Evaluating AI Research Capabilities in LLMs
ACL 2025
DecepBench: Benchmarking Multimodal Deception Detection
ACL 2025
PROTECT: Policy-Related Organizational Value Taxonomy for Ethical Compliance and Trust
ACL 2025
Analyzing the Linguistic Priors of Language Models with Synthetic Languages
ACL 2025
Something’s Fishy in the Data Lake: A Critical Re-evaluation of Table Union Search Benchmarks
ACL 2025
Ask Me Like I’m Human: LLM-based Evaluation with For-Human Instructions Correlates Better with Human Evaluations than Human Judges
ACL 2025
MultiLogicNMR(er): A Benchmark and Neural-Symbolic Framework for Non-monotonic Reasoning with Multiple Extensions
EMNLP 2025
<
1
…
4
5
6
…
67
>