Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Learning Types
Deep Learning
›
Learning Types
›
Evaluation
150 directly classified papers
Papers per year
2016: 1
2019: 4
2020: 3
2021: 9
2022: 11
2023: 19
2024: 40
2025: 63
Papers
LLMs can be easily Confused by Instructional Distractions
ACL 2025
AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark
ACL 2025
The Invisible Hand: Unveiling Provider Bias in Large Language Models for Code Generation
ACL 2025
CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models
ACL 2025
CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation
ACL 2025
“What do you call a dog that is incontrovertibly true? Dogma”: Testing LLM Generalization through Humor
ACL 2025
Benchmarking Long-Context Language Models on Long Code Understanding
ACL 2025
A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability
ACL 2025
FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation
ACL 2025
EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits
ACL 2025
Beyond Text Compression: Evaluating Tokenizers Across Scales
ACL 2025
Where Are We? Evaluating LLM Performance on African Languages
ACL 2025
WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging
ACL 2025
ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty
NIPS 2024
LooGLE: Can Long-Context Language Models Understand Long Contexts?
ACL 2024
SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark
ACL 2024
EconNLI: Evaluating Large Language Models on Economics Reasoning
ACL 2024
Realistic Evaluation of Toxicity in Large Language Models
ACL 2024
GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving
ACL 2024
Uncovering Limitations of Large Language Models in Information Seeking from Tables
ACL 2024
Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations
ACL 2024
Social Intelligence Data Infrastructure: Structuring the Present and Navigating the Future
ACL 2024
Bias in News Summarization: Measures, Pitfalls and Corpora
ACL 2024
Exploring Defeasibility in Causal Reasoning
ACL 2024
GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation
ACL 2024
<
1
2
3
4
5
6
>