More Than a Score: Probing the Impact of Prompt Specificity on LLM Code Generation

Yangtian Zi; Harshitha Menon; Arjun Guha

2025 AACL AACL 2025

More Than a Score: Probing the Impact of Prompt Specificity on LLM Code Generation

Abstract

AbstractState-of-the-art Large Language Models (LLMs) achieve high pass@1 on general benchmarks like HumanEval (Chen et al., 2021) but underperform on specialized suites such as ParEval (Nichols et al., 2024). Is this due to LLMs missing domain knowledge or insufficient prompt detail is given? To answer this, we introduce PartialOrderEval, which augments any code generation benchmark with a partial order of prompts from minimal to maximally detailed. Applying it to HumanEval and both serial and OpenMP subsets of ParEval, we measure how pass@1 scales with prompt specificity. Our experiments with Llama-3.x and Qwen2.5-Coder demonstrate varying degrees of prompt sensitivity across different tasks, and a qualitative analysis highlights explicit I/O specifications, edge-case handling, and stepwise breakdowns as the key drivers of prompt detail improvement.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — prompt specificity

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Natural Language Processing, Reinforcement Learning, Security & Privacy

Authors

Yangtian Zi , Harshitha Menon , Arjun Guha

Topics

Artificial Intelligence > Core AI > Foundation Models Machine Learning > Application Areas > Efficient Computing

Keywords

code generation domain-specific evaluation prompt sensitivity prompt specificity pass@1 metric

Download PDF

Related papers

Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge 2025

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems 2025

Enhancing Training Data Quality through Influence Scores for Generalizable Classification: A Case Study on Sexism Detection 2025

CtrlShift: Steering Language Models for Dense Quotation Retrieval with Dynamic Prompts 2025

A Diagnostic Framework for Auditing Reference-Free Vision-Language Metrics 2025