← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios ACL 2025

LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation ACL 2025

RoDEval: A Robust Word Sense Disambiguation Evaluation Framework for Large Language Models EMNLP 2025

Beyond Context to Cognitive Appraisal: Emotion Reasoning as a Theory of Mind Benchmark for Large Language Models ACL 2025

Mis-prompt: Benchmarking Large Language Models for Proactive Error Handling ACL 2025

LLMs are Biased Evaluators But Not Biased for Fact-Centric Retrieval Augmented Generation ACL 2025

Noise-Aware Evaluation of Object Detectors WACV 2025

skLEP: A Slovak General Language Understanding Benchmark ACL 2025

Value Portrait: Assessing Language Models’ Values through Psychometrically and Ecologically Valid Items ACL 2025

HATS : Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models ACL 2025

BasqBBQ: A QA Benchmark for Assessing Social Biases in LLMs for Basque, a Low-Resource Language COLING 2025

Practical Solutions to Practical Problems in Developing Argument Mining Systems ACL 2025

Do not Abstain! Identify and Solve the Uncertainty ACL 2025

LLM-based post-editing as reference-free GEC evaluation ACL 2025

InMind: Evaluating LLMs in Capturing and Applying Individual Human Reasoning Styles EMNLP 2025

RCScore: Quantifying Response Consistency in Large Language Models EMNLP 2025

OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain EMNLP 2025

Arena-lite: Efficient and Reliable Large Language Model Evaluation via Tournament-Based Direct Comparisons EMNLP 2025

NOVA-63: Native Omni-lingual Versatile Assessments of 63 Disciplines EMNLP 2025

Confounding Factors in Relating Model Performance to Morphology EMNLP 2025

UNCERTAINTY-LINE: Length-Invariant Estimation of Uncertainty for Large Language Models EMNLP 2025

PoSum-Bench: Benchmarking Position Bias in LLM-based Conversational Summarization EMNLP 2025

We Need to Measure Data Diversity in NLP — Better and Broader EMNLP 2025

FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs’ Responsiveness to Human Feedback EMNLP 2025

FinNLP-FNP-LLMFinLegal-2025 Shared Task: Financial Misinformation Detection Challenge Task COLING 2025