Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Core Methods
Machine Learning
›
Core Methods
›
Evaluation
167 directly classified papers
Papers per year
2007: 1
2009: 1
2010: 1
2011: 2
2012: 1
2013: 2
2014: 1
2015: 1
2017: 1
2018: 7
2019: 15
2020: 14
2021: 11
2022: 25
2023: 31
2024: 24
2025: 29
Papers
LLM The Genius Paradox: A Linguistic and Math Expert’s Struggle with Simple Word-based Counting Problems
NAACL 2025
Semantic-Eval : A Semantic Comprehension Evaluation Framework for Large Language Models Generation without Training
ACL 2025
A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability
ACL 2025
StrucText-Eval: Evaluating Large Language Model’s Reasoning Ability in Structure-Rich Text
ACL 2025
ALIGN-SIM: A Task-Free Test Bed for Evaluating and Interpreting Sentence Embeddings through Semantic Similarity Alignment
EMNLP 2024
A linguistically-motivated evaluation methodology for unraveling model’s abilities in reading comprehension tasks
EMNLP 2024
Beyond Correlation: Interpretable Evaluation of Machine Translation Metrics
EMNLP 2024
In2Core: Leveraging Influence Functions for Coreset Selection in Instruction Finetuning of Large Language Models
EMNLP 2024
CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation
ACL 2024
CharacterEval: A Chinese Benchmark for Role-Playing Conversational Agent Evaluation
ACL 2024
MARiA at SemEval 2024 Task-6: Hallucination Detection Through LLMs, MNLI, and Cosine similarity
SEMEVAL 2024
Estimating Agreement by Chance for Sequence Annotation
ACL 2024
A Benchmark Suite for Evaluating Neural Mutual Information Estimators on Unstructured Datasets
NIPS 2024
Assessing “Implicit” Retrieval Robustness of Large Language Models
EMNLP 2024
Rationale-Aware Answer Verification by Pairwise Self-Evaluation
EMNLP 2024
Greed is All You Need: An Evaluation of Tokenizer Inference Methods
ACL 2024
BenchIE^FL: A Manually Re-Annotated Fact-Based Open Information Extraction Benchmark
ACL 2024
On the Content Bias in Frechet Video Distance
CVPR 2024
Language Models can Evaluate Themselves via Probability Discrepancy
ACL 2024
StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code
ACL 2024
ToMBench: Benchmarking Theory of Mind in Large Language Models
ACL 2024
Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge
EMNLP 2024
Hypothesis Testing for Class-Conditional Noise Using Local Maximum Likelihood
AAAI 2024
Cousins Of The Vendi Score: A Family Of Similarity-Based Diversity Metrics For Science And Machine Learning
AISTATS 2024
Anomaly Score: Evaluating Generative Models and Individual Generated Images based on Complexity and Vulnerability
CVPR 2024
<
1
2
3
4
5
6
7
>