Does GitHub’s Copilot AI really “improve code quality”? Err, no, argues developer

The data only says so when you torture it beyond recognition

GitHub’s claim that its Copilot AI helps improve code quality has been challenged by a developer following an analysis of the research and data behind the claim.

Romania-based developer Dan Cîmpianu argued that the assignment provided to 243 developers as the basis of the study in November was simplistic and probably already well-covered by materials used in training the AI.

He also said the code reviews were not as rigorous as they should have been in assessing the quality of code generated.

As part of the study, released last month, GitHub had asked developers to put together a web site for handling restaurant reviews – a common create, read, update, delete (CRUD) kind-of application. The generated applications were then supposed to be reviewed by ten other developers.

Such a basic application is scarcely a rigorous test of the AI’s ability, said Cîmpianu; and of the 243 commissioned, only 202 submissions were considered valid. Some 104 used Copilot and 98 were coded without it so that they could be compared. However, just 1,293 code reviews were conducted instead of the 2,020 reviews expected.

On top of that, Cîmpianu questioned some of statistical assertions that GitHub made to support its claims. For example, the company says the AI-generated code contained 13.6% fewer code errors – 16 lines of code per error without AI, against 18.2 lines of code per error with AI.

But, Cîmpianu pointed out that such an extrapolation is somewhat misleading on such a small and limited sample. He also suggested the errors were not functional, but largely style issues.

“I make it no secret that I am against the AI hype train. So as much as I’d like to stay impartial for this, I must acknowledge my internal bias against the findings of this study,” he wrote.

“I’m not a scientist either, and my opinions aren’t facts, but as a software developer I’ve accrued a fantastic BS detector towards tech marketing, and this article maxed out its gauges.”

He described the GitHub task as one of the most “boring, repetitive, uninspired, and cognitively unchallenging aspects of development”, though admitted that such tasks might be ripe for automation.

“[But] if you really want to test Copilot, give it complex tasks, diverse tasks that involve huge SQL queries, regular expressions, shell-script deployments, anything more impressive than defining some REST stubs and type hints in Python.”

Nevertheless, even on a relatively simple task, the study’s numbers are all over the place - both in terms of the number that completed the task, as well as the percentages used to give the impression of the Copilot’s AI superiority. And having one set of developers on the study review everyone else’s will inevitably give rise to biases, Cîmpianu pointed out.

Ultimately, he concluded the 5% improvement in overall productivity the study claimed had “the perfume of marketing” all over it, given the biases in the methodology. It was, he suggested, just an exercise by the company to persuade C-level business leaders to pay GitHub subscriptions instead of investing it in employees.

“And as for you, developers, if you can’t write good code without an AI, then you shouldn’t use one in the first place. No matter what space-age technology the AI over-fitters come at you with, nothing can substitute personal experience, and pride in your craft.”