Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Optimization & Theory
Deep Learning
›
Optimization & Theory
›
Evaluation
345 directly classified papers
Papers per year
2014: 1
2016: 3
2017: 1
2018: 9
2019: 21
2020: 34
2021: 32
2022: 50
2023: 28
2024: 90
2025: 76
Papers
Compare without Despair: Reliable Preference Evaluation with Generation Separability
EMNLP 2024
Integrated Decision Gradients: Compute Your Attributions Where the Model Makes Its Decision
AAAI 2024
Impact of Decoding Methods on Human Alignment of Conversational LLMs
ACL 2024
A Systematic Analysis on the Temporal Generalization of Language Models in Social Media
ACL 2024
CORES: Convolutional Response-based Score for Out-of-distribution Detection
CVPR 2024
Knowledge Acquisition through Continued Pretraining is Difficult: A Case Study on r/AskHistorians
ACL 2024
Beyond Text: Unveiling Multimodal Proficiency of Large Language Models with MultiAPI Benchmark
ACL 2024
Empowering CAM-Based Methods with Capability to Generate Fine-Grained and High-Faithfulness Explanations
AAAI 2024
BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation
CVPR 2024
Evaluating Automatic Metrics with Incremental Machine Translation Systems
EMNLP 2024
LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores
ACL 2024
Generalization or Memorization: Data Contamination and Trustworthy Evaluation for Large Language Models
ACL 2024
CAVA: A Tool for Cultural Alignment Visualization & Analysis
EMNLP 2024
Challenging Large Language Models with New Tasks: A Study on their Adaptability and Robustness
ACL 2024
“My Answer is C”: First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models
ACL 2024
Comparing the Robustness of Modern No-Reference Image- and Video-Quality Metrics to Adversarial Attacks
AAAI 2024
RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation
EMNLP 2024
The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse
ACL 2024
StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation
ACL 2024
Evaluating the Elementary Multilingual Capabilities of Large Language Models with MultiQ
ACL 2024
Evaluating the Smooth Control of Attribute Intensity in Text Generation with LLMs
ACL 2024
MARiA at SemEval 2024 Task-6: Hallucination Detection Through LLMs, MNLI, and Cosine similarity
SEMEVAL 2024
FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models
CVPR 2024
The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance
EMNLP 2024
Revisiting Query Variation Robustness of Transformer Models
EMNLP 2024
<
1
…
4
5
6
…
14
>