The Future of the Error Message: Comparing Large Language Models and Novice Programmer Effectiveness in Fixing Errors
Research on error messages is of great interest to teachers and developers alike because improving Integrated Development Environments (IDEs) not only increases beginner student retention, it improves efficiency at all levels with more effective developing tools. This study aims to compare GPT-4’s effectiveness at fixing errors to novice programmer efficiency to assess the viability of Large Language Models (LLMs) as an enhancement tool. To do so, this study took a random sample of 100,000 sessions from all users of BlueJ 5, an IDE designed primarily for novice programmers, and analyzed the data for the time it took programmers to resolve coding errors. Subsequently, GPT-4 was given code from the same software with the error messages it had produced and prompted to provide an explanation for and fix the errors. This study replicated prior research that proposed a Zipf-Mandelbrot Distribution of the frequency of error messages, finding that the five most common errors comprised 45% of all error messages. The fix rates of humans and GPT-4 were compared, and it was found that humans still fixed code at higher rates, but GPT-4 provided completely correct explanations for error messages 96% of the time. This study concludes that GPT-4 functions best as a tool to explain the causes of error messages in an interactive format, rather than as a tool to produce correct code on its own.