This essay is a chapter from my book, The Intelligence Illusion: a practical guide to the business risks of Generative AI.
Language models should lend themselves to creating a variety of useful tools for software development and have the potential to revolutionise programming. Those tools don’t exist yet. What we have today might be the first step in that direction, but will need more improvement before they can be safely adopted by the field in general.
Code has a number of qualities that make it uniquely suited for language models. The structure and syntax of a programming language is much more consistent than any natural language. Its meaning usually doesn’t change based on context or culture. Hallucinations should be immediately discovered as errors. Train the model exclusively on permissively licensed open source code and you both make memorisation and overfitting non-issues, and minimise potential antagonism with the Free Software and Open Source communities.
Programming is the one field where the potential of language models is obvious to even the most ardent critic.
Potential, because current tools aren’t there yet, even though they feel ‘there’ to many programmers who use them. They still have a few issues.
These tools come in a few different varieties:
- Fancy autocompletes—sometimes labelled “code copilots” or “code assistants”, but they’re really just fancy autocompletes. You prompt it with a phrase, query, or code, and it will complete it with its best guess.
- Code modification and conversion tools. Tools that generate tests for your code fall into this category.
- Chat-based interfaces that are used to generate or explain code in an interactive manner.
Code modification and conversion tools build on the core strengths of large language models and are as close to a risk-free application of this technology as you can get. The other two, however, have issues.
Autocompletes were the first to gain serious adoption in the form of GitHub Copilot. Unfortunately, accidentally copying GPL-licenced code into your project is a real possibility with GitHub Copilot, which has serious consequences for projects whose licence is either proprietary or incompatible with the GPL.
That doesn’t mean other “code copilots” or the chat-based interfaces such as ChatGPT are free from risk. Autocompletes and chat-based interfaces create a dynamic that interferes with the programmer’s ability to assess the code it generates. Their design triggers two biases that make the coder more likely to accept code they would not find acceptable in other contexts.
1. Automation bias: People have a tendency to trust what the computer says over their own judgement. Or, to be more specific, when presented with a cognitive aid intended to help them think less, that’s exactly what they do: they think less. This leads them to be much less critical of the outcome of the cognitive aid than they would have of that same output in other circumstances. This has lead to serious issues in other industries, such as aerospace, where pilot trust in the capabilities of a plane’s autopilot over their own, causing catastrophe. This is a known problem with how we, as humans, interact with machines or any form of cognitive automation.
2. Anchoring bias: Our mind fixates on the first thing that anchors it in that context. For example, when it comes to shopping, the first price you see will be the anchor that you judge other prices from. The first result from an autocomplete tool is likely to trigger our anchoring bias.
These two biases combined mean that users of code assistants are extremely likely to accept the first suggestion the tool makes that doesn’t cause errors.
That means autocompletes need to be substantially better than an average coder to avoid having a detrimental effect on overall code quality. Obviously, if programmers are going to be picking the first suggestion that works, that suggestion needs to be at least as good as what they would have written themselves unaided. What’s less obvious is that the lack of familiarity—not having written the generated code by hand themselves—is likely to lead them to miss bugs and integration issues that would have been trivially obvious if they had typed it out themselves. To balance that out, the generated code needs to be better than average, which is a tricky thing to ask of a system that’s specifically trained on mountains of average code.
Unfortunately, that mediocrity seems to be reflected in the output. GitHub Copilot, for example seems to regularly generate vulnerable code with security issues.
Another issue is that AI coding tools will never push back against bad ideas. They will always try to generate a solution, even when it has no hope of generating safely working code. Many of the first ideas we have, as coders, are ill-advised. It comes with the territory. You have a problem. Think of a solution. Do some research to see how it would be done. Discover that the solution was a bad idea. Type “how to implement your own cryptography” into Stack Overflow and you will get answers that tell you, unequivocally, that you absolutely should not implement your own cryptography. Programming is full of these pitfalls and AI tools, with their hallucinations and a tendency to generate confident answers no matter what, are almost purpose-designed to trip you into them.
Some of these pitfalls have to do with age. Platforms, libraries, and APIs become obsolete, insecure, and outdated, but language models are, because of their design, stuck in the past. Newer methods are going to be underrepresented in their training data set if they appear in it at all. You are likely, for example, to run into instances where they would recommend you use outdated open source libraries that have since been replaced with faster and more reliable platform implementations. Or, it might generate code that uses APIs that have been deprecated or even removed in current versions of the platform. If the AI copilot isn’t ‘fresh’ enough, it’s going to cause problems.
Code assistant language models will need to be updated much more frequently than natural language models if they are to remain useful in the long term.
A related issue is cost: the biggest expense for most software projects is maintenance.
Software rots. Platforms change from under them. Software libraries switch up their APIs, requiring changes in your project. The understanding a programmer has of the problem changes, revealing flaws in the initial implementation. Requirements change. Most of the work of programming is in updating and changing code, not in writing code from scratch.
Autocompletion tools are of no help in this context. Chat-based tools are only helpful if you need your own code explained back to you, which is now more likely to happen because you might not have written the code yourself in the first place.
Code generation tools can even make this–the most expensive part of programming—even harder and more expensive. By making it incredibly easy to generate a lot of bug-filled, insecure, but broadly functional code, they are likely to lead to code base inflation.
Projects will be bigger while doing less, making them harder to fix, update, and maintain. By going faster, we will have slowed ourselves down. The cost-saving automation of generative AI is likely to be very expensive in the long run.
This was an excerpt from The Intelligence Illusion
What are the major business risks to avoid with generative AI? How do you avoid having it blow up in your face? Is that even possible?
The Intelligence Illusion is an exhaustively researched guide to the risks of language and diffusion models.
To a point. Remember “Don’t Believe ChatGPT - We Do NOT Offer a “Phone Lookup” Service,” February 2023, https://blog.opencagedata.com/post/dont-believe-chatgpt. ↩︎
Ryan Fleury, “The Gullible Software Altruist,” July 2022, https://www.rfleury.com/p/the-gullible-software-altruist. ↩︎
It’s concerning that instead of delivering GitHub Copilot as an unqualified positive, GitHub decided to snatch defeat from the jaws of victory and included copyleft licences such as the GPL in the training data set. It hints at a worrying lack of internal discussion and criticism. See “Questions Around Bias, Legalities in GitHub’s Copilot,” PWV Consultants, July 2021, https://www.pwvconsultants.com/blog/questions-around-bias-legalities-in-githubs-copilot/, for one example. ↩︎
That’d be me. ↩︎
I remain sceptical of this idea. You usually get better results in programming by starting with the test and then implementing code that passes the test. If these code generation tools end up promoting poor practices in writing tests, then that would more than nullify the benefit they provide in other ways. ↩︎
Elaine Atwell, “GitHub Copilot Isn’t Worth the Risk,” Kolide, February 2023, https://www.kolide.com/blog/github-copilot-isn-t-worth-the-risk; Bradley M. Kuhn, “If Software Is My Copilot, Who Programmed My Software?” Software Freedom Conservancy, February 2022, https://sfconservancy.org/blog/2022/feb/03/github-copilot-copyleft-gpl/; “Analyzing the Legal Implications of GitHub Copilot - FOSSA,” Dependency Heaven, July 2021, https://fossa.com/blog/analyzing-legal-implications-github-copilot/. ↩︎
Jeremy Howard, “fast.ai - Is GitHub Copilot a Blessing, or a Curse?” July 2021, https://www.fast.ai/posts/2021-07-19-copilot.html. ↩︎
K. Mosier and L. Skitka, “Human Decision Makers and Automated Decision Aids: Made for Each Other?” 1996, https://www.semanticscholar.org/paper/Human-Decision-Makers-and-Automated-Decision-Aids%3A-Mosier-Skitka/ffb65e76ac46fd42d595ed9272296f0cbe8ca7aa. ↩︎
Kathleen L. Mosier et al., “Automation Bias: Decision Making and Performance in High-Tech Cockpits,” The International Journal of Aviation Psychology 8, no. 1 (January 1998): 47–63, https://doi.org/10.1207/s15327108ijap0801_3; Kathleen L. Mosier et al., “Automation Bias, Accountability, and Verification Behaviors,” Proceedings of the Human Factors and Ergonomics Society Annual Meeting 40, no. 4 (October 1996): 204–8, https://doi.org/10.1177/154193129604000413. ↩︎
Raja Parasuraman and Victor Riley, “Humans and Automation: Use, Misuse, Disuse, Abuse,” Human Factors: The Journal of the Human Factors and Ergonomics Society 39, no. 2 (June 1997): 230–53, https://doi.org/10.1518/001872097778543886. ↩︎
“Anchoring Bias,” The Decision Lab, accessed April 3, 2023, https://thedecisionlab.com/biases/anchoring-bias. ↩︎
Hammond Pearce et al., “Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions” (arXiv, December 2021), https://doi.org/10.48550/arXiv.2108.09293; Neil Perry et al., “Do Users Write More Insecure Code with AI Assistants?” (arXiv, December 2022), https://doi.org/10.48550/arXiv.2211.03622. ↩︎
Tyler Glaiel, “Can GPT-4 *Actually* Write Code?” Substack newsletter, Tyler’s Substack, March 2023, https://tylerglaiel.substack.com/p/can-gpt-4-actually-write-code. ↩︎
Luca Rossi, “On AI Freshness, the Pyramid Principle, and Hierarchies 💡,” April 2023, https://refactoring.fm/p/on-ai-freshness-the-pyramid-principle. ↩︎