← Optimization & Theory

Deep Learning › Optimization & Theory ›

Evaluation

345 directly classified papers

Papers per year

Papers

Improving Accuracy and Calibration via Differentiated Deep Mutual Learning CVPR 2025

FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models CVPR 2024

ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object CVPR 2024

CORES: Convolutional Response-based Score for Out-of-distribution Detection CVPR 2024

BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation CVPR 2024

VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models EMNLP 2024

Scaling Laws of Synthetic Images for Model Training ... for Now CVPR 2024

Forgetting Curve: A Reliable Method for Evaluating Memorization Capability for Long-Context Models EMNLP 2024

LawBench: Benchmarking Legal Knowledge of Large Language Models EMNLP 2024

Do LLMs Overcome Shortcut Learning? An Evaluation of Shortcut Challenges in Large Language Models EMNLP 2024

Towards Reproducible, Automated, and Scalable Anomaly Detection AAAI 2024

Accelerating Adversarially Robust Model Selection for Deep Neural Networks via Racing AAAI 2024

Can Large Language Models Understand Real-World Complex Instructions? AAAI 2024

Discretization-Induced Dirichlet Posterior for Robust Uncertainty Quantification on Regression AAAI 2024

Benchmarking Segmentation Models with Mask-Preserved Attribute Editing CVPR 2024

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations EMNLP 2024

Integrated Decision Gradients: Compute Your Attributions Where the Model Makes Its Decision AAAI 2024

Impact of Decoding Methods on Human Alignment of Conversational LLMs ACL 2024

A Systematic Analysis on the Temporal Generalization of Language Models in Social Media ACL 2024

Knowledge Acquisition through Continued Pretraining is Difficult: A Case Study on r/AskHistorians ACL 2024

Beyond Text: Unveiling Multimodal Proficiency of Large Language Models with MultiAPI Benchmark ACL 2024

Empowering CAM-Based Methods with Capability to Generate Fine-Grained and High-Faithfulness Explanations AAAI 2024

LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores ACL 2024

Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models ACL 2024

Challenging Large Language Models with New Tasks: A Study on their Adaptability and Robustness ACL 2024