Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
Can LLMs Rank the Harmfulness of Smaller LLMs? We are Not There Yet
ACL 2025
ChatBench: From Static Benchmarks to Human-AI Evaluation
ACL 2025
Are Small Language Models Ready to Compete with Large Language Models for Practical Applications?
NAACL 2025
Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMs
ACL 2025
SSA-COMET: Do LLMs Outperform Learned Metrics in Evaluating MT for Under-Resourced African Languages?
EMNLP 2025
QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation
ACL 2025
Graders Should Cheat: Privileged Information Enables Expert-Level Automated Evaluations
EMNLP 2025
InductionBench: LLMs Fail in the Simplest Complexity Class
ACL 2025
RoDEval: A Robust Word Sense Disambiguation Evaluation Framework for Large Language Models
EMNLP 2025
Towards Robust Universal Information Extraction: Dataset, Evaluation, and Solution
ACL 2025
NOVA-63: Native Omni-lingual Versatile Assessments of 63 Disciplines
EMNLP 2025
ConLoan: A Contrastive Multilingual Dataset for Evaluating Loanwords
ACL 2025
MultiLogicNMR(er): A Benchmark and Neural-Symbolic Framework for Non-monotonic Reasoning with Multiple Extensions
EMNLP 2025
SubLIME: Subset Selection via Rank Correlation Prediction for Data-Efficient LLM Evaluation
ACL 2025
InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles
EMNLP 2025
ETF: An Entity Tracing Framework for Hallucination Detection in Code Summaries
ACL 2025
RCScore: Quantifying Response Consistency in Large Language Models
EMNLP 2025
CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models
ACL 2025
Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives
IJCAI 2025
Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories
ACL 2025
UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models
EMNLP 2025
SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science
ACL 2025
We Need to Measure Data Diversity in NLP — Better and Broader
EMNLP 2025
WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging
ACL 2025
Do LLMs Behave as Claimed? Investigating How LLMs Follow Their Own Claims using Counterfactual Questions
EMNLP 2025
<
1
…
5
6
7
…
67
>