Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Learning Types
Deep Learning
›
Learning Types
›
Evaluation
150 directly classified papers
Papers per year
2016: 1
2019: 4
2020: 3
2021: 9
2022: 11
2023: 19
2024: 40
2025: 63
Papers
Towards Robust Numerical Question Answering: Diagnosing Numerical Capabilities of NLP Systems
EMNLP 2022
BERTScore is Unfair: On Social Bias in Language Model-Based Metrics for Text Generation
EMNLP 2022
Enabling Detailed Action Recognition Evaluation Through Video Dataset Augmentation
NIPS 2022
Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature
EMNLP 2022
ePiC: Employing Proverbs in Context as a Benchmark for Abstract Language Understanding
ACL 2022
Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis
EMNLP 2022
Assessing the Linguistic Knowledge in Arabic Pre-trained Language Models Using Minimal Pairs
EMNLP 2022
IMPLI: Investigating NLI Models’ Performance on Figurative Language
ACL 2022
Testing Cross-Database Semantic Parsers With Canonical Utterances
EMNLP 2021
Perception Matters: Detecting Perception Failures of VQA Models Using Metamorphic Testing
CVPR 2021
Shortcutted Commonsense: Data Spuriousness in Deep Learning of Commonsense Reasoning
EMNLP 2021
Trainable Ranking Models to Evaluate the Semantic Accuracy of Data-to-Text Neural Generator
EMNLP 2021
Perception Score: A Learned Metric for Open-ended Text Generation Evaluation
AAAI 2021
ESTIME: Estimation of Summary-to-Text Inconsistency by Mismatched Embeddings
EMNLP 2021
Just Ask! Evaluating Machine Translation by Asking and Answering Questions
EMNLP 2021
TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing
ACL 2021
LEGOEval: An Open-Source Toolkit for Dialogue System Evaluation via Crowdsourcing
ACL 2021
MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics
EMNLP 2020
TRENDNERT: A Benchmark for Trend and Downtrend Detection in a Scientific Domain
AAAI 2020
Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation
ACL 2020
Analyzing Compositionality-Sensitivity of NLI Models
AAAI 2019
Towards Actual (Not Operational) Textual Style Transfer Auto-Evaluation
EMNLP 2019
Do You Know That Florence Is Packed with Visitors? Evaluating State-of-the-art Models of Speaker Commitment
ACL 2019
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference
ACL 2019
Improving the Robustness of Deep Neural Networks via Stability Training
CVPR 2016
<
1
2
3
4
5
6
>