Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Optimization & Theory
Machine Learning
›
Optimization & Theory
›
Evaluation
515 directly classified papers
Papers per year
2003: 1
2004: 1
2005: 1
2006: 1
2008: 2
2009: 1
2010: 1
2013: 5
2016: 3
2017: 8
2018: 11
2019: 24
2020: 25
2021: 34
2022: 68
2023: 74
2024: 105
2025: 147
2026: 3
Papers
BOSE: A Systematic Evaluation Method Optimized for Base Models
ACL 2025
100-LongBench: Are de facto Long-Context Benchmarks Literally Evaluating Long-Context Ability?
ACL 2025
Lost in Transcription, Found in Distribution Shift: Demystifying Hallucination in Speech Foundation Models
ACL 2025
Beyond Pointwise Scores: Decomposed Criteria-Based Evaluation of LLM Responses
EMNLP 2025
Reproducing the Argument Quality Prediction of Project Debater
ACL 2025
Can LLMs Reliably Simulate Real Students’ Abilities in Mathematics and Reading Comprehension?
ACL 2025
LLMs as Medical Safety Judges: Evaluating Alignment with Human Annotation in Patient-Facing QA
ACL 2025
Noise-Aware Evaluation of Object Detectors
WACV 2025
KazBench-KK: A Cultural-Knowledge Benchmark for Kazakh
ACL 2025
Knockout LLM Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons
ACL 2025
Evaluating LLMs with Multiple Problems at once
ACL 2025
(Towards) Scalable Reliable Automated Evaluation with Large Language Models
ACL 2025
ReproHum #0744-02: A Reproduction of the Human Evaluation of Meaning Preservation in “Factorising Meaning and Form for Intent-Preserving Paraphrasing”
ACL 2025
Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics
ACL 2025
An Analysis of Datasets, Metrics and Models in Keyphrase Generation
ACL 2025
Findings of the IWSLT 2025 Evaluation Campaign
ACL 2025
Factual Knowledge in Language Models: Robustness and Anomalies under Simple Temporal Context Variations
ACL 2025
The UNLP 2025 Shared Task on Detecting Social Media Manipulation
ACL 2025
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon
EMNLP 2025
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning
ACL 2025
M-IFEval: Multilingual Instruction-Following Evaluation
NAACL 2025
Logical forms complement probability in understanding language model (and human) performance
ACL 2025
From Parameters to Performance: A Data-Driven Study on LLM Structure and Development
EMNLP 2025
Can’t See the Forest for the Trees: Benchmarking Multimodal Safety Awareness for Multimodal LLMs
ACL 2025
Single Ground Truth Is Not Enough: Adding Flexibility to Aspect-Based Sentiment Analysis Evaluation
NAACL 2025
<
1
2
3
4
5
…
21
>