Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Learning Types
Deep Learning
›
Learning Types
›
Evaluation
150 directly classified papers
Papers per year
2016: 1
2019: 4
2020: 3
2021: 9
2022: 11
2023: 19
2024: 40
2025: 63
Papers
Unveiling the Bias Impact on Symmetric Moral Consistency of Large Language Models
NIPS 2024
Evaluating Numerical Reasoning in Text-to-Image Models
NIPS 2024
Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles
NIPS 2024
RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question Answering
EMNLP 2023
VIPHY: Probing “Visible” Physical Commonsense Knowledge
EMNLP 2023
CReTIHC: Designing Causal Reasoning Tasks about Temporal Interventions and Hallucinated Confoundings
EMNLP 2023
Exploring Context-Aware Evaluation Metrics for Machine Translation
EMNLP 2023
Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction Following: A Case Study of Arabic
EMNLP 2023
CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity and Infant Care
NIPS 2023
Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning
ACL 2023
Toward Human-Like Evaluation for Natural Language Generation with Error Analysis
ACL 2023
Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation
ACL 2023
Beyond mAP: Towards Better Evaluation of Instance Segmentation
CVPR 2023
A Large-Scale Homography Benchmark
CVPR 2023
SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality
NIPS 2023
Common Law Annotations: Investigating the Stability of Dialog System Output Annotations
ACL 2023
UINAUIL: A Unified Benchmark for Italian Natural Language Understanding
ACL 2023
HalOmi: A Manually Annotated Benchmark for Multilingual Hallucination and Omission Detection in Machine Translation
EMNLP 2023
“Fifty Shades of Bias”: Normative Ratings of Gender Bias in GPT Generated English Text
EMNLP 2023
MEGA: Multilingual Evaluation of Generative AI
EMNLP 2023
Prompting is not a substitute for probability measurements in large language models
EMNLP 2023
INVITE: a Testbed of Automatically Generated Invalid Questions to Evaluate Large Language Models for Hallucinations
EMNLP 2023
Feeding What You Need by Understanding What You Learned
ACL 2022
How General-Purpose Is a Language Model? Usefulness and Safety with Human Prompters in the Wild
AAAI 2022
MT-GenEval: A Counterfactual and Contextual Dataset for Evaluating Gender Accuracy in Machine Translation
EMNLP 2022
<
1
2
3
4
5
6
>