← Learning Types

Deep Learning › Learning Types ›

Evaluation

150 directly classified papers

Papers per year

Papers

LLMs can be easily Confused by Instructional Distractions ACL 2025

AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark ACL 2025

The Invisible Hand: Unveiling Provider Bias in Large Language Models for Code Generation ACL 2025

CogniBench: A Legal-inspired Framework and Dataset for Assessing Cognitive Faithfulness of Large Language Models ACL 2025

CoreEval: Automatically Building Contamination-Resilient Datasets with Real-World Knowledge toward Reliable LLM Evaluation ACL 2025

“What do you call a dog that is incontrovertibly true? Dogma”: Testing LLM Generalization through Humor ACL 2025

Benchmarking Long-Context Language Models on Long Code Understanding ACL 2025

A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability ACL 2025

FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation ACL 2025

EditInspector: A Benchmark for Evaluation of Text-Guided Image Edits ACL 2025

Beyond Text Compression: Evaluating Tokenizers Across Scales ACL 2025

Where Are We? Evaluating LLM Performance on African Languages ACL 2025

WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging ACL 2025

ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty NIPS 2024

LooGLE: Can Long-Context Language Models Understand Long Contexts? ACL 2024

SceMQA: A Scientific College Entrance Level Multimodal Question Answering Benchmark ACL 2024

EconNLI: Evaluating Large Language Models on Economics Reasoning ACL 2024

Realistic Evaluation of Toxicity in Large Language Models ACL 2024

GeoEval: Benchmark for Evaluating LLMs and Multi-Modal Models on Geometry Problem-Solving ACL 2024

Uncovering Limitations of Large Language Models in Information Seeking from Tables ACL 2024

Merging Facts, Crafting Fallacies: Evaluating the Contradictory Nature of Aggregated Factual Claims in Long-Form Generations ACL 2024

Social Intelligence Data Infrastructure: Structuring the Present and Navigating the Future ACL 2024

Bias in News Summarization: Measures, Pitfalls and Corpora ACL 2024

Exploring Defeasibility in Causal Reasoning ACL 2024

GAOKAO-MM: A Chinese Human-Level Benchmark for Multimodal Models Evaluation ACL 2024