We tried ChatGPT for vulnerability fixes. Most flaws are too complex for generative AI alone
An experiment with ChatGPT 3.5 found that 80% of code fixes were unusable or introduced new vulnerabilities
AI can be used in reducing vulnerability backlogs - but treat the results with caution.
The success of ChatGPT and other generative AI tools in the workplace is a testament to the power of the tech to change how we all work.
Organisations have tried many different applications, from writing blogs to writing code, with varying degrees of success.
Those of us in the DevSecOps space have also been busy experimenting. So far, generative AI has proven to add value, but relying on AI alone is risky.
Our team experimented with publicly available generative AI tools. We aimed to observe how these tools handle fixing SAST-reported findings and how their answers measure when compared to an AppSec expert.
Here's what we found:
Experimenting with generative AI to fix vulnerabilities
For this process, our team gathered a set of 105 SAST findings reported by a couple of SAST tools available in the marketplace on two different known vulnerable OWASP applications. These apps are often used in training benchmark tools, so we used JuiceShop and WebGoat for our observation.
The remedial work to fix the 105 issues identified in the SAST reports was automated with OpenAI's API using GPT 3.5. Furthermore, the team pre-processed and cleaned up the data to enable ChatGPT to process the information efficiently.
It is important to note that most developers may not have the luxury of cleaning up the data they feed into their AI models.
How did ChatGPT do?
As expected, it was a mixed result. Of the 105 issues identified, ChatGPT provided suggestions that would have successfully resolved the problem 30% of the time. However, many of the recommended fixes did not follow industry best practices that properly maintain code security and quality.
Furthermore, in another 19% of cases ChatGPT suggested coding in the area of the reported vulnerabilities but did not actually fix the underlying issue - or, in some cases, introduced new vulnerabilities to the threat environment.
Additionally, more than half (51%) of the suggested fixes were simply unusable. Our experiment found that the fixes provided would have negatively impacted other, non-relevant parts of the application, or it only provided templates that would then require further developer work to write the code to fix the reported vulnerability.
In some cases, the code was syntactically wrong and caused the application not to compile or run, breaking the application.
More worryingly, the AI tool generated references, functions and code libraries that do not exist. This phenomenon, generally called AI ‘hallucinations,' stems from AI's training data limitations, inherent biases and inability to understand real-world information.
Limitations
ChatGPT converts each word into a legible token whenever you ask a question. It had a limit of 4,096 tokens at the time of the experiment, which puts a ceiling on the input and output length per task.
For example, if a user enters code in the length of 2,000 tokens and asks ChatGPT 3.5 to fix it, the tool does not have enough room to answer.
Considering that specific tasks can run into tens of thousands of coding lines, the concern is that ChatGPT's response may become truncated or stop abruptly. When you also consider that the free version's limit of the tool is even smaller, teams may have fewer tokens to work with.
The practicality of using ChatGPT for large files can be frustrating, because developers will need time to learn and understand exactly what parts of the code they must provide to ChatGPT. This process is only sometimes straightforward and may involve combining codes from multiple files to feed into the AI tool. This is a result of the input limitation, which can prolong the remediation work immeasurably.
Generative AI is an excellent tool with the potential to create positive change in application security. It might not yet be the magical answer DevSecOps teams hoped for, but in skilled hands, combined with thorough research and other technologies, AI can cautiously be used in reducing vulnerability backlogs.
However, our findings make clear that the majority of generative AI vulnerability fixes require expert oversight, particularly for complex vulnerabilities.
Eitan Worcel is CEO and co-founder at Mobb