Evaluating GPT-4 Code Generation as a Grading Mechanism for “Explain-in-Plain-English” Questions
The ability to “Explain in plain English” (EiPE) is a critical skill for students in introductory programming skills to develop. However,evaluating this skill has been challenging as manual grading is time consuming and not easily automated. Constructing an effective prompt for a language model, such as OpenAI’s GPT-4, to generate code that achieves a specified goal bears a striking resemblance to the skill of EiPE. In this paper, we explore the potential of using test cases run on code generated by GPT-4 from students’ EiPE responses as a grading mechanism for EiPE questions. We applied this proposed grading method to a corpus of EiPE responses collected from past exams, then measured agreement between the results of this grading method and human graders. Overall, we find moderate agreement between the human raters and the results of the unittests run on the generated code. This appears to be attributable toGPT-4’s code generation being more lenient than human graders on low-level descriptions of code.