← Optimization & Theory

Machine Learning › Optimization & Theory ›

Evaluation

515 directly classified papers

Papers per year

Papers

QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation EMNLP 2025

Contextual Metric Meta-Evaluation by Measuring Local Metric Accuracy NAACL 2025

Investigating Value-Reasoning Reliability in Small Large Language Models EMNLP 2025

Are Checklists Really Useful for Automatic Evaluation of Generative Tasks? EMNLP 2025

Benchmark Profiling: Mechanistic Diagnosis of LLM Benchmarks EMNLP 2025

Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions EMNLP 2025

Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs EMNLP 2025

Metric Calculating Benchmark: Code-Verifiable Complicate Instruction Following Benchmark for Large Language Models EMNLP 2025

EIFBENCH: Extremely Complex Instruction Following Benchmark for Large Language Models EMNLP 2025

Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon EMNLP 2025

From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models EMNLP 2025

In Benchmarks We Trust ... Or Not? EMNLP 2025

Evaluating the Evaluators: Are readability metrics good measures of readability? EMNLP 2025

TFDP: Token-Efficient Disparity Audits for Autoregressive LLMs via Single-Token Masked Evaluation EMNLP 2025

From Parameters to Performance: A Data-Driven Study on LLM Structure and Development EMNLP 2025

Model Consistency as a Cheap yet Predictive Proxy for LLM Elo Scores EMNLP 2025

Estimating LLM Consistency: A User Baseline vs Surrogate Metrics EMNLP 2025

Revealing and Mitigating the Challenge of Detecting Character Knowledge Errors in LLM Role-Playing EMNLP 2025

Benchmarking LLMs on Semantic Overlap Summarization EMNLP 2025

KRETA: A Benchmark for Korean Reading and Reasoning in Text-Rich VQA Attuned to Diverse Visual Contexts EMNLP 2025

TounsiBench: Benchmarking Large Language Models for Tunisian Arabic EMNLP 2025

Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards EMNLP 2025

GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems EMNLP 2025

Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses EMNLP 2025

Feeding Two Birds or Favoring One? Adequacy–Fluency Tradeoffs in Evaluation and Meta-Evaluation of Machine Translation EMNLP 2025