Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Optimization & Theory
Machine Learning
›
Optimization & Theory
›
Evaluation
515 directly classified papers
Papers per year
2003: 1
2004: 1
2005: 1
2006: 1
2008: 2
2009: 1
2010: 1
2013: 5
2016: 3
2017: 8
2018: 11
2019: 24
2020: 25
2021: 34
2022: 68
2023: 74
2024: 105
2025: 147
2026: 3
Papers
Tracking the evolution of LLM capabilities for Belarusian with OpenAI Evals
EACL 2026
Evaluating Text-to-Image and Text-to-Video Synthesis with a Conditional Frechet Distance
WACV 2026
CAST: Evaluating Multi-Object Trackers with Context-Aware Switch and Transfer Scores
WACV 2026
BIG-Bench Extra Hard
ACL 2025
Identifying Open Challenges in Language Identification
ACL 2025
Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models
ACL 2025
Rubrik’s Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset
ACL 2025
HalluLens: LLM Hallucination Benchmark
ACL 2025
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning
ACL 2025
A Statistical and Multi-Perspective Revisiting of the Membership Inference Attack in Large Language Models
ACL 2025
Can’t See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs
ACL 2025
Are Small Language Models Ready to Compete with Large Language Models for Practical Applications?
NAACL 2025
Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization
ACL 2025
Trade-offs in Image Generation: How Do Different Dimensions Interact?
ICCV 2025
Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives
IJCAI 2025
Standard Quality Criteria Derived from Current NLP Evaluations for Guiding Evaluation Design and Grounding Comparability and AI Compliance Assessments
ACL 2025
A Pipeline to Assess Merging Methods via Behavior and Internals
EMNLP 2025
Logical forms complement probability in understanding language model (and human) performance
ACL 2025
Towards Evaluation of Language Models with Skill Dimensions: A Case Study on Narrative Question Answering
EMNLP 2025
Readability Reconsidered A Cross-Dataset Analysis of Reference-Free Metrics
EMNLP 2025
Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems
COLING 2025
Using Next Sentence Prediction to Test ChatGPT’s Text Comprehension (Student Abstract)
AAAI 2025
M-IFEval: Multilingual Instruction-Following Evaluation
NAACL 2025
Semantic Masking in a Needle-in-a-haystack Test for Evaluating Large Language Model Long-Text Capabilities
COLING 2025
LLMs can be easily Confused by Instructional Distractions
ACL 2025
<
1
2
3
4
5
…
21
>