Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Learning Types
Machine Learning
›
Learning Types
›
Evaluation
1654 directly classified papers
Papers per year
2005: 1
2006: 1
2007: 1
2008: 2
2009: 1
2010: 3
2011: 2
2012: 3
2013: 5
2014: 4
2015: 4
2016: 11
2017: 19
2018: 32
2019: 39
2020: 72
2021: 110
2022: 202
2023: 222
2024: 351
2025: 569
Papers
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
ACL 2025
Unveiling Cultural Blind Spots: Analyzing the Limitations of mLLMs in Procedural Text Comprehension
ACL 2025
CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models
ACL 2025
Disentangling Language and Culture for Evaluating Multilingual Large Language Models
ACL 2025
CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation
ACL 2025
MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation
ACL 2025
“What do you call a dog that is incontrovertibly true? Dogma”: Testing LLM Generalization through Humor
ACL 2025
HalluLens: LLM Hallucination Benchmark
ACL 2025
Towards Fully Exploiting LLM Internal States to Enhance Knowledge Boundary Perception
ACL 2025
Can Language Models Replace Programmers for Coding? REPOCOD Says ‘Not Yet’
ACL 2025
Response Wide Shut? Surprising Observations in Basic Vision Language Model Capabilities
ACL 2025
HumT DumT: Measuring and controlling human-like language in LLMs
ACL 2025
ChatBench: From Static Benchmarks to Human-AI Evaluation
ACL 2025
Anything Goes? A Crosslinguistic Study of (Im)possible Language Learning in LMs
ACL 2025
QG-SMS: Enhancing Test Item Analysis via Student Modeling and Simulation
ACL 2025
InductionBench: LLMs Fail in the Simplest Complexity Class
ACL 2025
Towards Robust Universal Information Extraction: Dataset, Evaluation, and Solution
ACL 2025
ConLoan: A Contrastive Multilingual Dataset for Evaluating Loanwords
ACL 2025
SubLIME: Subset Selection via Rank Correlation Prediction for Data-Efficient LLM Evaluation
ACL 2025
ETF: An Entity Tracing Framework for Hallucination Detection in Code Summaries
ACL 2025
CCHall: A Novel Benchmark for Joint Cross-Lingual and Cross-Modal Hallucinations Detection in Large Language Models
ACL 2025
Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories
ACL 2025
SeedBench: A Multi-task Benchmark for Evaluating Large Language Models in Seed Science
ACL 2025
WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging
ACL 2025
CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization
ACL 2025
<
1
2
3
4
5
…
67
>