← Learning Types

Deep Learning › Learning Types ›

Evaluation

150 directly classified papers

Papers per year

Papers

Evaluating Mathematical Reasoning of Large Language Models: A Focus on Error Identification and Correction ACL 2024

Beyond Probabilities: Unveiling the Misalignment in Evaluating Large Language Models ACL 2024

FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions CVPR 2024

Synthesize Diagnose and Optimize: Towards Fine-Grained Vision-Language Understanding CVPR 2024

Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models? EMNLP 2024

A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners EMNLP 2024

Beyond Reference: Evaluating High Quality Translations Better than Human References EMNLP 2024

Annotation alignment: Comparing LLM and human annotations of conversational safety EMNLP 2024

Split and Merge: Aligning Position Biases in LLM-based Evaluators EMNLP 2024

ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models EMNLP 2024

Reasoning or a Semblance of it? A Diagnostic Study of Transitive Reasoning in LLMs EMNLP 2024

SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories EMNLP 2024

Where am I? Large Language Models Wandering between Semantics and Structures in Long Contexts EMNLP 2024

BiasAlert: A Plug-and-play Tool for Social Bias Detection in LLMs EMNLP 2024

CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation EMNLP 2024

FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models EMNLP 2024

Evalverse: Unified and Accessible Library for Large Language Model Evaluation EMNLP 2024

M5 – A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks EMNLP 2024

CONSTRUCTURE: Benchmarking CONcept STRUCTUre REasoning for Multimodal Large Language Models EMNLP 2024

Plot Twist: Multimodal Models Don’t Comprehend Simple Chart Details EMNLP 2024

AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models EMNLP 2024

Factcheck-Bench: Fine-Grained Evaluation Benchmark for Automatic Fact-checkers EMNLP 2024

AXCEL: Automated eXplainable Consistency Evaluation using LLMs EMNLP 2024

Pitfalls and Outlooks in Using COMET EMNLP 2024

Micro-Bench: A Microscopy Benchmark for Vision-Language Understanding NIPS 2024