SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses

Dongwei Jiang; Jingyu Zhang; Orion Weller; Nathaniel Weir; Benjamin Van Durme; Daniel Khashabi

2025 AAAI AAAI 2025

SELF-[IN]CORRECT: LLMs Struggle with Discriminating Self-Generated Responses

Abstract

Abstract Can LLMs consistently improve their previous outputs for better results? For this to be true, LLMs would need to be better at discriminating among previously-generated alternatives, than generating initial responses. We explore the validity of this hypothesis in practice. We first formulate a unified framework that allows us to compare the generative and discriminative capability of any model on any task. In our resulting experimental analysis of several open-source and industrial LLMs, we observe that model’s are not reliably better at discriminating among previously-generated alternatives than generating initial responses. This finding challenges the notion that LLMs may be able to enhance their performance only through their own judgment.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — discriminative capability

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Dongwei Jiang , Jingyu Zhang , Orion Weller , Nathaniel Weir , Benjamin Van Durme , Daniel Khashabi

Topics

Artificial Intelligence > Core AI > Interpretability Machine Learning > Optimization & Theory > Learning Theory Natural Language Processing > Resources & Methods > Large Language Models Artificial Intelligence > Core AI > Reasoning Deep Learning > Models > Large Language Models Deep Learning > Optimization & Theory > Evaluation

Keywords

model evaluation generative capability large language model discriminative capability response discrimination

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025