StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code

Hannah McLean Babe; Sydney Nguyen; Yangtian Zi; Arjun Guha; Molly Q Feldman; Carolyn Jane Anderson

2024 ACL ACL 2024

StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code

Abstract

AbstractCode LLMs have the potential to make it easier for non-experts to understand and write code. However, current CodeLLM benchmarks rely on a single expert-written prompt per problem, making it hard to generalize their success to non-expert users. In this paper, we present a new natural-language-to-code benchmark of prompts written by a key population of non-experts: beginning programmers. StudentEval contains 1,749 prompts written by 80 students who have only completed one introductory Python course. StudentEval contains numerous non-expert prompts describing the same problem, enabling exploration of key factors in prompt success. We use StudentEval to evaluate 12 Code LLMs and find that StudentEval is a better discriminator of model performance than existing benchmarks. Our analysis of student prompting strategies reveals that nondeterministic LLM sampling can mislead students about the quality of their descriptions, a finding with key implications for Code LLMs in education.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hannah McLean Babe , Sydney Nguyen , Yangtian Zi , Arjun Guha , Molly Q Feldman , Carolyn Jane Anderson

Topics

Artificial Intelligence > Core AI > Foundation Models Machine Learning > Application Areas > Efficient Computing Natural Language Processing > Generation > Text Generation Natural Language Processing > Resources & Methods > Large Language Models Artificial Intelligence > Core AI > Large Language Models Deep Learning > Models > Large Language Models Natural Language Processing > Applications > Text Generation Machine Learning > Learning Types > Evaluation Machine Learning > Core Methods > Evaluation Machine Learning > Application Areas > Evaluation

Keywords

benchmark evaluation prompt engineering code generation code understanding natural language to code large language model

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024