← Learning Types

Machine Learning › Learning Types ›

Evaluation

1654 directly classified papers

Papers per year

Papers

Memorization ≠ Understanding: Do Large Language Models Have the Ability of Scenario Cognition? EMNLP 2025

The Emperor’s New Reasoning: Format Imitation Overshadows Genuine Mathematical Understanding in SFT EMNLP 2025

Memorization or Reasoning? Exploring the Idiom Understanding of LLMs EMNLP 2025

From Understanding to Generation: An Efficient Shortcut for Evaluating Language Models EMNLP 2025

Transitive self-consistency evaluation of NLI models without gold labels EMNLP 2025

Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles EMNLP 2025

DCR: Quantifying Data Contamination in LLMs Evaluation EMNLP 2025

Drivel-ology: Challenging LLMs with Interpreting Nonsense with Depth EMNLP 2025

Agent-as-Judge for Factual Summarization of Long Narratives EMNLP 2025

Scalable and Culturally Specific Stereotype Dataset Construction via Human-LLM Collaboration EMNLP 2025

Towards a Holistic and Automated Evaluation Framework for Multi-Level Comprehension of LLMs in Book-Length Contexts EMNLP 2025

Adaptively profiling models with task elicitation EMNLP 2025

Co-Eval: Augmenting LLM-based Evaluation with Machine Metrics EMNLP 2025

OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature EMNLP 2025

Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance EMNLP 2025

Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements EMNLP 2025

BOUQuET : dataset, Benchmark and Open initiative for Universal Quality Evaluation in Translation EMNLP 2025

EGOILLUSION: Benchmarking Hallucinations in Egocentric Video Understanding EMNLP 2025

How to Protect Yourself from 5G Radiation? Investigating LLM Responses to Implicit Misinformation EMNLP 2025

Do LLMs Behave as Claimed? Investigating How LLMs Follow Their Own Claims using Counterfactual Questions EMNLP 2025

Can LLMs Extract Frame-Semantic Arguments? EMNLP 2025

Are Language Models Consequentialist or Deontological Moral Reasoners? EMNLP 2025

PatentScore: Multi-dimensional Evaluation of LLM-Generated Patent Claims EMNLP 2025

Can Large Language Models Outperform Non-Experts in Poetry Evaluation? A Comparative Study Using the Consensual Assessment Technique EMNLP 2025

UTER: Capturing the Human Touch in Evaluating Morphologically Rich and Low-Resource Languages NAACL 2025