← Core Methods

Machine Learning › Core Methods ›

Evaluation

167 directly classified papers

Papers per year

Papers

CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation ACL 2024

MARiA at SemEval 2024 Task-6: Hallucination Detection Through LLMs, MNLI, and Cosine similarity SEMEVAL 2024

CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation ACL 2024

Why Is the Winner the Best? CVPR 2023

Zero-Shot Data Maps. Efficient Dataset Cartography Without Model Training EMNLP 2023

Class Adaptive Network Calibration CVPR 2023

Exploring Context-Aware Evaluation Metrics for Machine Translation EMNLP 2023

Lenient Evaluation of Japanese Speech Recognition: Modeling Naturally Occurring Spelling Inconsistency ACL 2023

GENTLE: A Genre-Diverse Multilayer Challenge Set for English NLP and Linguistic Evaluation ACL 2023

Ranger: A Toolkit for Effect-Size Based Multi-Task Evaluation ACL 2023

Optimizing ROC Curves with a Sort-Based Surrogate Loss for Binary Classification and Changepoint Detection JMLR 2023

DPAUC: Differentially Private AUC Computation in Federated Learning AAAI 2023

DeltaScore: Fine-Grained Story Evaluation with Perturbations EMNLP 2023

BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric ACL 2023

A Closer Look into Using Large Language Models for Automatic Evaluation EMNLP 2023

FactSpotter: Evaluating the Factual Faithfulness of Graph-to-Text Generation EMNLP 2023

Did the Models Understand Documents? Benchmarking Models for Language Understanding in Document-Level Relation Extraction ACL 2023

BLEURT Has Universal Translations: An Analysis of Automatic Metrics by Minimum Risk Training ACL 2023

A Data-Based Perspective on Transfer Learning CVPR 2023

WikiHowQA: A Comprehensive Benchmark for Multi-Document Non-Factoid Question Answering ACL 2023

ULF: Unsupervised Labeling Function Correction using Cross-Validation for Weak Supervision EMNLP 2023

Automated Metrics for Medical Multi-Document Summarization Disagree with Human Evaluations ACL 2023

HAUSER: Towards Holistic and Automatic Evaluation of Simile Generation ACL 2023

BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of Faithfulness Metrics ACL 2023

The Devil is in the Details: On the Pitfalls of Event Extraction Evaluation ACL 2023