Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Core Methods
Machine Learning
›
Core Methods
›
Evaluation
167 directly classified papers
Papers per year
2007: 1
2009: 1
2010: 1
2011: 2
2012: 1
2013: 2
2014: 1
2015: 1
2017: 1
2018: 7
2019: 15
2020: 14
2021: 11
2022: 25
2023: 31
2024: 24
2025: 29
Papers
CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation
ACL 2024
MARiA at SemEval 2024 Task-6: Hallucination Detection Through LLMs, MNLI, and Cosine similarity
SEMEVAL 2024
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation
ACL 2024
Why Is the Winner the Best?
CVPR 2023
Zero-Shot Data Maps. Efficient Dataset Cartography Without Model Training
EMNLP 2023
Class Adaptive Network Calibration
CVPR 2023
Exploring Context-Aware Evaluation Metrics for Machine Translation
EMNLP 2023
Lenient Evaluation of Japanese Speech Recognition: Modeling Naturally Occurring Spelling Inconsistency
ACL 2023
GENTLE: A Genre-Diverse Multilayer Challenge Set for English NLP and Linguistic Evaluation
ACL 2023
Ranger: A Toolkit for Effect-Size Based Multi-Task Evaluation
ACL 2023
Optimizing ROC Curves with a Sort-Based Surrogate Loss for Binary Classification and Changepoint Detection
JMLR 2023
DPAUC: Differentially Private AUC Computation in Federated Learning
AAAI 2023
DeltaScore: Fine-Grained Story Evaluation with Perturbations
EMNLP 2023
BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric
ACL 2023
A Closer Look into Using Large Language Models for Automatic Evaluation
EMNLP 2023
FactSpotter: Evaluating the Factual Faithfulness of Graph-to-Text Generation
EMNLP 2023
Did the Models Understand Documents? Benchmarking Models for Language Understanding in Document-Level Relation Extraction
ACL 2023
BLEURT Has Universal Translations: An Analysis of Automatic Metrics by Minimum Risk Training
ACL 2023
A Data-Based Perspective on Transfer Learning
CVPR 2023
WikiHowQA: A Comprehensive Benchmark for Multi-Document Non-Factoid Question Answering
ACL 2023
ULF: Unsupervised Labeling Function Correction using Cross-Validation for Weak Supervision
EMNLP 2023
Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations
ACL 2023
HAUSER: Towards Holistic and Automatic Evaluation of Simile Generation
ACL 2023
BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of Faithfulness Metrics
ACL 2023
The Devil is in the Details: On the Pitfalls of Event Extraction Evaluation
ACL 2023
<
1
2
3
4
5
6
7
>