← Core Methods

Machine Learning › Core Methods ›

Evaluation

167 directly classified papers

Papers per year

Papers

LLM The Genius Paradox: A Linguistic and Math Expert’s Struggle with Simple Word-based Counting Problems NAACL 2025

Semantic-Eval : A Semantic Comprehension Evaluation Framework for Large Language Models Generation without Training ACL 2025

A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability ACL 2025

StrucText-Eval: Evaluating Large Language Model’s Reasoning Ability in Structure-Rich Text ACL 2025

ALIGN-SIM: A Task-Free Test Bed for Evaluating and Interpreting Sentence Embeddings through Semantic Similarity Alignment EMNLP 2024

A linguistically-motivated evaluation methodology for unraveling model’s abilities in reading comprehension tasks EMNLP 2024

Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics EMNLP 2024

In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models EMNLP 2024

CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation ACL 2024

CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation ACL 2024

MARiA at SemEval 2024 Task-6: Hallucination Detection Through LLMs, MNLI, and Cosine similarity SEMEVAL 2024

Estimating Agreement by Chance for Sequence Annotation ACL 2024

A Benchmark Suite for Evaluating Neural Mutual Information Estimators on Unstructured Datasets NIPS 2024

Assessing “Implicit” Retrieval Robustness of Large Language Models EMNLP 2024

Rationale-Aware Answer Verification by Pairwise Self-Evaluation EMNLP 2024

Greed is All You Need: An Evaluation of Tokenizer Inference Methods ACL 2024

BenchIE^FL: A Manually Re-Annotated Fact-Based Open Information Extraction Benchmark ACL 2024

On the Content Bias in Frechet Video Distance CVPR 2024

Language Models can Evaluate Themselves via Probability Discrepancy ACL 2024

StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code ACL 2024

ToMBench: Benchmarking Theory of Mind in Large Language Models ACL 2024

Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge EMNLP 2024

Hypothesis Testing for Class-Conditional Noise Using Local Maximum Likelihood AAAI 2024

Cousins Of The Vendi Score: A Family Of Similarity-Based Diversity Metrics For Science And Machine Learning AISTATS 2024

Anomaly Score: Evaluating Generative Models and Individual Generated Images based on Complexity and Vulnerability CVPR 2024