← Optimization & Theory

Machine Learning › Optimization & Theory ›

Evaluation

515 directly classified papers

Papers per year

Papers

Tracking the evolution of LLM capabilities for Belarusian with OpenAI Evals EACL 2026

Evaluating Text-to-Image and Text-to-Video Synthesis with a Conditional Frechet Distance WACV 2026

CAST: Evaluating Multi-Object Trackers with Context-Aware Switch and Transfer Scores WACV 2026

BIG-Bench Extra Hard ACL 2025

Identifying Open Challenges in Language Identification ACL 2025

Chinese SimpleQA: A Chinese Factuality Evaluation for Large Language Models ACL 2025

Rubrik’s Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset ACL 2025

HalluLens: LLM Hallucination Benchmark ACL 2025

PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning ACL 2025

A Statistical and Multi-Perspective Revisiting of the Membership Inference Attack in Large Language Models ACL 2025

Can’t See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs ACL 2025

Are Small Language Models Ready to Compete with Large Language Models for Practical Applications? NAACL 2025

Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization ACL 2025

Trade-offs in Image Generation: How Do Different Dimensions Interact? ICCV 2025

Image Captioning Evaluation in the Age of Multimodal LLMs: Challenges and Future Perspectives IJCAI 2025

Standard Quality Criteria Derived from Current NLP Evaluations for Guiding Evaluation Design and Grounding Comparability and AI Compliance Assessments ACL 2025

A Pipeline to Assess Merging Methods via Behavior and Internals EMNLP 2025

Logical forms complement probability in understanding language model (and human) performance ACL 2025

Towards Evaluation of Language Models with Skill Dimensions: A Case Study on Narrative Question Answering EMNLP 2025

Readability Reconsidered A Cross-Dataset Analysis of Reference-Free Metrics EMNLP 2025

Know Your RAG: Dataset Taxonomy and Generation Strategies for Evaluating RAG Systems COLING 2025

Using Next Sentence Prediction to Test ChatGPT’s Text Comprehension (Student Abstract) AAAI 2025

M-IFEval: Multilingual Instruction-Following Evaluation NAACL 2025

Semantic Masking in a Needle-in-a-haystack Test for Evaluating Large Language Model Long-Text Capabilities COLING 2025

LLMs can be easily Confused by Instructional Distractions ACL 2025