← Learning Types

Deep Learning › Learning Types ›

Evaluation

150 directly classified papers

Papers per year

Papers

Towards Robust Numerical Question Answering: Diagnosing Numerical Capabilities of NLP Systems EMNLP 2022

BERTScore is Unfair: On Social Bias in Language Model-Based Metrics for Text Generation EMNLP 2022

Enabling Detailed Action Recognition Evaluation Through Video Dataset Augmentation NIPS 2022

Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature EMNLP 2022

ePiC: Employing Proverbs in Context as a Benchmark for Abstract Language Understanding ACL 2022

Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis EMNLP 2022

Assessing the Linguistic Knowledge in Arabic Pre-trained Language Models Using Minimal Pairs EMNLP 2022

IMPLI: Investigating NLI Models’ Performance on Figurative Language ACL 2022

Testing Cross-Database Semantic Parsers With Canonical Utterances EMNLP 2021

Perception Matters: Detecting Perception Failures of VQA Models Using Metamorphic Testing CVPR 2021

Shortcutted Commonsense: Data Spuriousness in Deep Learning of Commonsense Reasoning EMNLP 2021

Trainable Ranking Models to Evaluate the Semantic Accuracy of Data-to-Text Neural Generator EMNLP 2021

Perception Score: A Learned Metric for Open-ended Text Generation Evaluation AAAI 2021

ESTIME: Estimation of Summary-to-Text Inconsistency by Mismatched Embeddings EMNLP 2021

Just Ask! Evaluating Machine Translation by Asking and Answering Questions EMNLP 2021

TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing ACL 2021

LEGOEval: An Open-Source Toolkit for Dialogue System Evaluation via Crowdsourcing ACL 2021

MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics EMNLP 2020

TRENDNERT: A Benchmark for Trend and Downtrend Detection in a Scientific Domain AAAI 2020

Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation ACL 2020

Analyzing Compositionality-Sensitivity of NLI Models AAAI 2019

Towards Actual (Not Operational) Textual Style Transfer Auto-Evaluation EMNLP 2019

Do You Know That Florence Is Packed with Visitors? Evaluating State-of-the-art Models of Speaker Commitment ACL 2019

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference ACL 2019

Improving the Robustness of Deep Neural Networks via Stability Training CVPR 2016