← Core Methods

Machine Learning › Core Methods ›

Evaluation

167 directly classified papers

Papers per year

Papers

BI-Bench : A Comprehensive Benchmark Dataset and Unsupervised Evaluation for BI Systems ACL 2025

Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing CVPR 2025

CLEME2.0: Towards Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction ACL 2025

A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability ACL 2025

An Analysis of Datasets, Metrics and Models in Keyphrase Generation ACL 2025

Towards a Principled Evaluation of Knowledge Editors ACL 2025

Gen3DEval: Using vLLMs for Automatic Evaluation of Generated 3D Objects CVPR 2025

DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation AAAI 2025

StrucText-Eval: Evaluating Large Language Model’s Reasoning Ability in Structure-Rich Text ACL 2025

ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering ACL 2025

Standard Quality Criteria Derived from Current NLP Evaluations for Guiding Evaluation Design and Grounding Comparability and AI Compliance Assessments ACL 2025

(Towards) Scalable Reliable Automated Evaluation with Large Language Models ACL 2025

EditBoard: Towards a Comprehensive Evaluation Benchmark for Text-Based Video Editing Models AAAI 2025

Evaluate with the Inverse: Efficient Approximation of Latent Explanation Quality Distribution AAAI 2025

How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach ICCV 2025

EXECUTE: A Multilingual Benchmark for LLM Token Understanding ACL 2025

A Simple and Comprehensive Benchmark for Single-Cell Transcriptomics AAAI 2025

LLM The Genius Paradox: A Linguistic and Math Expert’s Struggle with Simple Word-based Counting Problems NAACL 2025

RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation EMNLP 2025

Parametric ρ-Norm Scaling Calibration AAAI 2025

Temporal-Aware Evaluation and Learning for Temporal Graph Neural Networks AAAI 2025

Semantic-Eval : A Semantic Comprehension Evaluation Framework for Large Language Models Generation without Training ACL 2025

Using Next Sentence Prediction to Test ChatGPT’s Text Comprehension (Student Abstract) AAAI 2025

Cherry-Picking in Time Series Forecasting: How to Select Datasets to Make Your Model Shine AAAI 2025

Pathway to Relevance: How Cross-Encoders Implement a Semantic Variant of BM25 EMNLP 2025