Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
Benchmarking Large Language Models Under Data Contamination: A Survey from Static to Dynamic Evaluation
EMNLP 2025
Calibrating LLM Confidence by Probing Perturbed Representation Stability
EMNLP 2025
LLMs cannot spot math errors, even when allowed to peek into the solution
EMNLP 2025
Long-Form Information Alignment Evaluation Beyond Atomic Facts
EMNLP 2025
SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages?
EMNLP 2025
Towards Optimal Evaluation Efficiency for Large Language Models
EMNLP 2025
Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation
EMNLP 2025
SSA: Semantic Contamination of LLM-Driven Fake News Detection
EMNLP 2025
DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors
EMNLP 2025
CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists
EMNLP 2025
Can LLMs simulate the same correct solutions to free-response math problems as real students?
EMNLP 2025
TurBLiMP: A Turkish Benchmark of Linguistic Minimal Pairs
EMNLP 2025
Graders Should Cheat: Privileged Information Enables Expert-Level Automated Evaluations
EMNLP 2025
Are Large Language Models Chronically Online Surfers? A Dataset for Chinese Internet Meme Explanation
EMNLP 2025
RoDEval: A Robust Word Sense Disambiguation Evaluation Framework for Large Language Models
EMNLP 2025
SciEvent: Benchmarking Multi-domain Scientific Event Extraction
EMNLP 2025
Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios
EMNLP 2025
Evaluating and Aligning Human Economic Risk Preferences in LLMs
EMNLP 2025
STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models
EMNLP 2025
MultiLogicNMR(er): A Benchmark and Neural-Symbolic Framework for Non-monotonic Reasoning with Multiple Extensions
EMNLP 2025
Women, Infamous, and Exotic Beings: A Comparative Study of Honorific Usages in Wikipedia and LLMs for Bengali and Hindi
EMNLP 2025
Can LLMs Generate and Solve Linguistic Olympiad Puzzles?
EMNLP 2025
MolErr2Fix: Benchmarking LLM Trustworthiness in Chemistry via Modular Error Detection, Localization, Explanation, and Correction
EMNLP 2025
Shared Path: Unraveling Memorization in Multilingual LLMs through Language Similarities
EMNLP 2025
Re-evaluating Automatic LLM System Ranking for Alignment with Human Preference
NAACL 2025
<
1
…
8
9
10
…
67
>