← Optimization & Theory

Machine Learning › Optimization & Theory ›

Evaluation

515 directly classified papers

Papers per year

Papers

Identifying Open Challenges in Language Identification ACL 2025

Open Ko-LLM Leaderboard2: Bridging Foundational and Practical Evaluation for Korean LLMs NAACL 2025

Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization ACL 2025

Are Small Language Models Ready to Compete with Large Language Models for Practical Applications? NAACL 2025

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models ACL 2025

LLMs Are Not Intelligent Thinkers: Introducing Mathematical Topic Tree Benchmark for Comprehensive Evaluation of LLMs NAACL 2025

LLMs can be easily Confused by Instructional Distractions ACL 2025

Advocating Character Error Rate for Multilingual ASR Evaluation NAACL 2025

A Statistical and Multi-Perspective Revisiting of the Membership Inference Attack in Large Language Models ACL 2025

Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions NAACL 2025

Rubrik’s Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset ACL 2025

GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems EMNLP 2025

HalluLens: LLM Hallucination Benchmark ACL 2025

A Conformal Risk Control Framework for Granular Word Assessment and Uncertainty Calibration of CLIPScore Quality Estimates ACL 2025

PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models ACL 2025

Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards EMNLP 2025

On Many-Shot In-Context Learning for Long-Context Evaluation ACL 2025

RCI: A Score for Evaluating Global and Local Reasoning in Multimodal Benchmarks EMNLP 2025

BIG-Bench Extra Hard ACL 2025

Trade-offs in Image Generation: How Do Different Dimensions Interact? ICCV 2025

A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability ACL 2025

Accurate Estimation of Feature Importance Faithfulness for Tree Models AAAI 2025

Language Model Probabilities are Not Calibrated in Numeric Contexts ACL 2025

Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses EMNLP 2025

Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models COLING 2025