← Application Areas

Machine Learning › Application Areas ›

Evaluation

22 directly classified papers

Papers per year

Papers

Towards a Principled Evaluation of Knowledge Editors ACL 2025

Video-Bench: Human-Aligned Video Generation Benchmark CVPR 2025

EEE-Bench: A Comprehensive Multimodal Electrical And Electronics Engineering Benchmark CVPR 2025

DOMAINEVAL: An Auto-Constructed Benchmark for Multi-Domain Code Generation AAAI 2025

Standard Quality Criteria Derived from Current NLP Evaluations for Guiding Evaluation Design and Grounding Comparability and AI Compliance Assessments ACL 2025

(Towards) Scalable Reliable Automated Evaluation with Large Language Models ACL 2025

Benchmark Data Repositories for Better Benchmarking NIPS 2024

StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code ACL 2024

chrF-S: Semantics Is All You Need EMNLP 2024

MSLC24: Further Challenges for Metrics on a Wide Landscape of Translation Quality EMNLP 2024

MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task EMNLP 2024

Expanding the FLORES+ Multilingual Benchmark with Translations for Aragonese, Aranese, Asturian, and Valencian EMNLP 2024

Adaptive Labeling for Efficient Out-of-distribution Model Evaluation NIPS 2024

Who Evaluates the Evaluations? Objectively Scoring Text-to-Image Prompt Coherence Metrics with T2IScoreScore (TS2) NIPS 2024

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion NIPS 2023

BLASER: A Text-Free Speech-to-Speech Translation Evaluation Metric ACL 2023

Stop Uploading Test Data in Plain Text: Practical Strategies for Mitigating Data Contamination by Evaluation Benchmarks EMNLP 2023

Evaluating the Knowledge Dependency of Questions EMNLP 2022

Automated Evaluation Metric for Terminology Consistency in MT EMNLP 2022

ACES: Translation Accuracy Challenge Sets for Evaluating Machine Translation Metrics EMNLP 2022

SeqScore: Addressing Barriers to Reproducible Named Entity Recognition Evaluation EMNLP 2021

Dscorer: A Fast Evaluation Metric for Discourse Representation Structure Parsing ACL 2020