← Learning Types

Deep Learning › Learning Types ›

Evaluation

150 directly classified papers

Papers per year

Papers

Unveiling the Bias Impact on Symmetric Moral Consistency of Large Language Models NIPS 2024

Evaluating Numerical Reasoning in Text-to-Image Models NIPS 2024

Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles NIPS 2024

RoMQA: A Benchmark for Robust, Multi-evidence, Multi-answer Question Answering EMNLP 2023

VIPHY: Probing “Visible” Physical Commonsense Knowledge EMNLP 2023

CReTIHC: Designing Causal Reasoning Tasks about Temporal Interventions and Hallucinated Confoundings EMNLP 2023

Exploring Context-Aware Evaluation Metrics for Machine Translation EMNLP 2023

Analyzing Multilingual Competency of LLMs in Multi-Turn Instruction Following: A Case Study of Arabic EMNLP 2023

CARE-MI: Chinese Benchmark for Misinformation Evaluation in Maternity and Infant Care NIPS 2023

Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning ACL 2023

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis ACL 2023

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation ACL 2023

Beyond mAP: Towards Better Evaluation of Instance Segmentation CVPR 2023

A Large-Scale Homography Benchmark CVPR 2023

SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality NIPS 2023

Common Law Annotations: Investigating the Stability of Dialog System Output Annotations ACL 2023

UINAUIL: A Unified Benchmark for Italian Natural Language Understanding ACL 2023

HalOmi: A Manually Annotated Benchmark for Multilingual Hallucination and Omission Detection in Machine Translation EMNLP 2023

“Fifty Shades of Bias”: Normative Ratings of Gender Bias in GPT Generated English Text EMNLP 2023

MEGA: Multilingual Evaluation of Generative AI EMNLP 2023

Prompting is not a substitute for probability measurements in large language models EMNLP 2023

INVITE: a Testbed of Automatically Generated Invalid Questions to Evaluate Large Language Models for Hallucinations EMNLP 2023

Feeding What You Need by Understanding What You Learned ACL 2022

How General-Purpose Is a Language Model? Usefulness and Safety with Human Prompters in the Wild AAAI 2022

MT-GenEval: A Counterfactual and Contextual Dataset for Evaluating Gender Accuracy in Machine Translation EMNLP 2022