Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Learning Types
Deep Learning
›
Learning Types
›
Evaluation
150 directly classified papers
Papers per year
2016: 1
2019: 4
2020: 3
2021: 9
2022: 11
2023: 19
2024: 40
2025: 63
Papers
Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction
ACL 2024
Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models
ACL 2024
FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions
CVPR 2024
Synthesize Diagnose and Optimize: Towards Fine-Grained Vision-Language Understanding
CVPR 2024
Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models?
EMNLP 2024
A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners
EMNLP 2024
Beyond Reference: Evaluating High Quality Translations Better than Human References
EMNLP 2024
Annotation alignment: Comparing LLM and human annotations of conversational safety
EMNLP 2024
Split and Merge: Aligning Position Biases in LLM-based Evaluators
EMNLP 2024
ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models
EMNLP 2024
Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs
EMNLP 2024
SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories
EMNLP 2024
Where am I? Large Language Models Wandering between Semantics and Structures in Long Contexts
EMNLP 2024
BiasAlert: A Plug-and-play Tool for Social Bias Detection in LLMs
EMNLP 2024
CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation
EMNLP 2024
FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models
EMNLP 2024
Evalverse: Unified and Accessible Library for Large Language Model Evaluation
EMNLP 2024
M5 – A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks
EMNLP 2024
CONSTRUCTURE: Benchmarking CONcept STRUCTUre REasoning for Multimodal Large Language Models
EMNLP 2024
Plot Twist: Multimodal Models Don’t Comprehend Simple Chart Details
EMNLP 2024
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models
EMNLP 2024
Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers
EMNLP 2024
AXCEL: Automated eXplainable Consistency Evaluation using LLMs
EMNLP 2024
Pitfalls and Outlooks in Using COMET
EMNLP 2024
Micro-Bench: A Microscopy Benchmark for Vision-Language Understanding
NIPS 2024
<
1
2
3
4
5
6
>