← Optimization & Theory

Deep Learning › Optimization & Theory ›

Evaluation

345 directly classified papers

Papers per year

Papers

Compare without Despair: Reliable Preference Evaluation with Generation Separability EMNLP 2024

Integrated Decision Gradients: Compute Your Attributions Where the Model Makes Its Decision AAAI 2024

Impact of Decoding Methods on Human Alignment of Conversational LLMs ACL 2024

A Systematic Analysis on the Temporal Generalization of Language Models in Social Media ACL 2024

CORES: Convolutional Response-based Score for Out-of-distribution Detection CVPR 2024

Knowledge Acquisition through Continued Pretraining is Difficult: A Case Study on r/AskHistorians ACL 2024

Beyond Text: Unveiling Multimodal Proficiency of Large Language Models with MultiAPI Benchmark ACL 2024

Empowering CAM-Based Methods with Capability to Generate Fine-Grained and High-Faithfulness Explanations AAAI 2024

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation CVPR 2024

Evaluating Automatic Metrics with Incremental Machine Translation Systems EMNLP 2024

LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores ACL 2024

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models ACL 2024

CAVA: A Tool for Cultural Alignment Visualization & Analysis EMNLP 2024

Challenging Large Language Models with New Tasks: A Study on their Adaptability and Robustness ACL 2024

“My Answer is C”: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models ACL 2024

Comparing the Robustness of Modern No-Reference Image- and Video-Quality Metrics to Adversarial Attacks AAAI 2024

RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation EMNLP 2024

The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse ACL 2024

StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation ACL 2024

Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ ACL 2024

Evaluating the Smooth Control of Attribute Intensity in Text Generation with LLMs ACL 2024

MARiA at SemEval 2024 Task-6: Hallucination Detection Through LLMs, MNLI, and Cosine similarity SEMEVAL 2024

FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models CVPR 2024

The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance EMNLP 2024

Revisiting Query Variation Robustness of Transformer Models EMNLP 2024