Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Core Methods
Machine Learning
›
Core Methods
›
Evaluation
167 directly classified papers
Papers per year
2007: 1
2009: 1
2010: 1
2011: 2
2012: 1
2013: 2
2014: 1
2015: 1
2017: 1
2018: 7
2019: 15
2020: 14
2021: 11
2022: 25
2023: 31
2024: 24
2025: 29
Papers
BI-Bench : A Comprehensive Benchmark Dataset and Unsupervised Evaluation for BI Systems
ACL 2025
Preserve or Modify? Context-Aware Evaluation for Balancing Preservation and Modification in Text-Guided Image Editing
CVPR 2025
CLEME2.0: Towards Interpretable Evaluation by Disentangling Edits for Grammatical Error Correction
ACL 2025
A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability
ACL 2025
An Analysis of Datasets, Metrics and Models in Keyphrase Generation
ACL 2025
Towards a Principled Evaluation of Knowledge Editors
ACL 2025
Gen3DEval: Using vLLMs for Automatic Evaluation of Generated 3D Objects
CVPR 2025
DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation
AAAI 2025
StrucText-Eval: Evaluating Large Language Model’s Reasoning Ability in Structure-Rich Text
ACL 2025
ProxAnn: Use-Oriented Evaluations of Topic Models and Document Clustering
ACL 2025
Standard Quality Criteria Derived from Current NLP Evaluations for Guiding Evaluation Design and Grounding Comparability and AI Compliance Assessments
ACL 2025
(Towards) Scalable Reliable Automated Evaluation with Large Language Models
ACL 2025
EditBoard: Towards a Comprehensive Evaluation Benchmark for Text-Based Video Editing Models
AAAI 2025
Evaluate with the Inverse: Efficient Approximation of Latent Explanation Quality Distribution
AAAI 2025
How Far are AI-generated Videos from Simulating the 3D Visual World: A Learned 3D Evaluation Approach
ICCV 2025
EXECUTE: A Multilingual Benchmark for LLM Token Understanding
ACL 2025
A Simple and Comprehensive Benchmark for Single-Cell Transcriptomics
AAAI 2025
LLM The Genius Paradox: A Linguistic and Math Expert’s Struggle with Simple Word-based Counting Problems
NAACL 2025
RefVNLI: Towards Scalable Evaluation of Subject-driven Text-to-image Generation
EMNLP 2025
Parametric ρ-Norm Scaling Calibration
AAAI 2025
Temporal-Aware Evaluation and Learning for Temporal Graph Neural Networks
AAAI 2025
Semantic-Eval : A Semantic Comprehension Evaluation Framework for Large Language Models Generation without Training
ACL 2025
Using Next Sentence Prediction to Test ChatGPT’s Text Comprehension (Student Abstract)
AAAI 2025
Cherry-Picking in Time Series Forecasting: How to Select Datasets to Make Your Model Shine
AAAI 2025
Pathway to Relevance: How Cross-Encoders Implement a Semantic Variant of BM25
EMNLP 2025
<
1
2
3
4
5
6
7
>