Modern software quality, or why I think using language models for programming is a bad idea

This essay is based on a talk I gave at Hakkavélin, a hackerspace in Reykjavík. I had a wonderful time presenting to a lovely crowd, full of inquisitive and critically-minded people. Their questions and the discussion afterwards led to a number of improvements and clarifications as I turned my notes into this letter. This resulted in a substantial expansion of this essay. Many of the expanded points, such as the ones surrounding language model security, come directly from these discussions.

Many thanks to all of those who attended. The references for the presentation are also the references for this essay, which you can find all the way down in the footnotes section.

The best way to support this newsletter or my blog is to buy one of my books, The Intelligence Illusion: a practical guide to the business risks of Generative AI or Out of the Software Crisis. Or, you can buy them both as a bundle.

The software industry is very bad at software

Here’s a true story. Names withheld to protect the innocent.

A chain of stores here in Iceland recently upgraded their point-of-sale terminals to use new software.

Disaster, obviously, ensued. The barcode scanner stopped working properly, leading customer to be either overcharged or undercharged. Everything was extremely slow. The terminals started to lock up regularly. The new invoice printer sucked. A process that had been working smoothly was now harder and took more time.

The store, where my “informant” is a manager, deals with a lot of businesses, many of them stores. When they explain to their customers why everything is taking so long, their answer is generally the same:

“Ah, software upgrade. The same happened to us when we upgraded our terminals.”

This is the norm.

The new software is worse in every way than what it’s replacing. Despite having a more cluttered UI, it seems to have omitted a bunch of important features. Despite being new and “optimised”, it’s considerably slower than what it’s replacing.

This is also the norm.

Switching costs are, more often than not, massive for business software, and purchases are not decided by anybody who actually uses it. The quality of the software disconnects from sales performance very quickly in a growing software company. The company ends up “owning” the customer and no longer has any incentive to improve the software. In fact, because adding features is a key marketing and sales tactic, the software development cycle becomes an act of intentional, controlled deterioration.

Enormous engineering resources go into finding new ways to minimise the deterioration—witness Microsoft’s “ribbon menu”, a widget invented entirely to manage the feature escalation mandated by marketing.

This is the norm.

This has always been the norm, from the early days of software.

The software industry is bad at software. Great at shipping features and selling software. Bad at the software itself.

Why I started researching “AI” for programming

In most sectors of the software industry, sales performance and product quality are disconnected.

By its nature software has enormous margins which further cushion it from the effect of delivering bad products.

The objective impact of poor software quality on the bottom lines of companies like Microsoft, Google, Apple, Facebook, or the retail side of Amazon is a rounding error. The rest only need to deliver usable early versions, but once you have an established customer base and an experienced sales team, you can coast for a long, long time without improving your product in any meaningful way.

You only need to show change. Improvements don’t sell, it’s freshness that moves product. It’s like store tomatoes. Needs to look good and be fresh. They’re only going to taste it after they’ve paid, so who cares about the actual quality.

Uptime reliability is the only quality measurement with a real impact on ad revenue or the success of enterprise contracts, so that’s the only quality measurement that ultimately matters to them.

Bugs, shoddy UX, poor accessibility—even when accessibility is required by law—are non-factors in modern software management, especially at larger software companies.

The rest of us in the industry then copy their practices, and we mostly get away with it. Our margins may not be as enormous as Google’s, but they are still quite good compared to non-software industries.

We have an industry that’s largely disconnected from the consequences of making bad products, which means that we have a lot of successful but bad products.

The software crisis

Research bears this out. I pointed out in my 2021 essay Software Crisis 2.0 that very few non-trivial software projects are successful, even when your benchmarks are fundamentally conservative and short term.

For example, the following table is from a 2015 report by the Standish Group on their long term study in software project success:

	SUCCESSFUL	CHALLENGED	FAILED	TOTAL
Grand	6%	51%	43%	100%
Large	11%	59%	30%	100%
Medium	12%	62%	26%	100%
Moderate	24%	64%	12%	100%
Small	61%	32%	7%	100%

The Chaos Report 2015 resolution by project size

This is based on data that’s collected and anonymised from a number of organisations in a variety of industries. You’ll note that very few projects outright succeed. Most of them go over budget or don’t deliver the functionality they were supposed to. A frightening number of large projects outright fail to ship anything usable.

In my book Out of the Software Crisis, I expanded on this by pointing out that there are many classes and types of bugs and defects that we don’t measure at all, many of them catastrophic, which means that these estimates are conservative. Software project failure is substantially higher than commonly estimated, and success if much rarer than the numbers would indicate.

The true percentage of large software projects that are genuinely successful in the long term—that don’t have any catastrophic bugs, don’t suffer from UX deterioration, don’t end up having core issues that degrade their business value—is probably closer to 1–3%.

The management crisis

We also have a management crisis.

The methods of top-down-control taught to managers are counterproductive for software development.

Managers think design is about decoration when it’s the key to making software that generates value.
Trying to prevent projects that are likely to fail is harmful for your career, even if the potential failure is wide-ranging and potentially catastrophic.
When projects fail, it’s the critics who tried to prevent disaster who are blamed, not the people who ran it into the ground.
Supporting a project that is guaranteed to fail is likely to benefit your career, establish you as a “team player”, and protects you from harmful consequences when the project crashes.
Teams and staff management in the software industry commonly ignores every innovation and discovery in organisational psychology, management, and systems-thinking since the early sixties and operate mostly on management ideas that Henry Ford considered outdated in the 1920s.

We are a mismanaged industry that habitually fails to deliver usable software that actually solves the problems it’s supposed to.

Thus, Weinberg’s Law:

If builders built buildings the way programmers wrote programs, then the first woodpecker that came along would destroy civilization.

It’s into this environment that “AI” software development tools appear.

The punditry presented it as a revolutionary improvement in how we make software. It’s supposed to fix everything.

—This time the silver bullet will work!

Because, of course, we have had such a great track record with silver bullets.

So, I had to dive into it, research it, and figure out how it really worked. I needed to understand how generative AI works, as a system. I haven’t researched any single topic to this degree since I finished my PhD in 2006.

This research led me to write my book The Intelligence Illusion: a practical guide to the business risks of Generative AI. In it, I take a broader view and go over the risks I discovered that come with business use of generative AI.

But, ultimately, all that work was to answer the one question that I was ultimately interested in:

Is generative AI good or bad for software development?

To even have a hope of answering this, we first need to define our terms, because the conclusion is likely to vary a lot depending on how you define “AI” or even "software development.

A theory of software development as an inclusive system

Software development is the entire system of creating, delivering, and using a software project, from idea to end-user.

That includes the entire process on the development side—the idea, planning, management, design, collaboration, programming, testing, prototyping—as well as the value created by the system when it has been shipped and is being used.

My model is that of theory-building. From my essay on theory-building, which itself is an excerpt from Out of the Software Crisis:

Beyond that, software is a theory. It’s a theory about a particular solution to a problem. Like the proverbial garden, it is composed of a microscopic ecosystem of artefacts, each of whom has to be treated like a living thing. The gardener develops a sense of how the parts connect and affect each other, what makes them thrive, what kills them off, and how you prompt them to grow. The software project and its programmers are an indivisible and organic entity that our industry treats like a toy model made of easily replaceable lego blocks. They believe a software project and its developers can be broken apart and reassembled without dying.

What keeps the software alive are the programmers who have an accurate mental model (theory) of how it is built and works. That mental model can only be learned by having worked on the project while it grew or by working alongside somebody who did, who can help you absorb the theory. Replace enough of the programmers, and their mental models become disconnected from the reality of the code, and the code dies. That dead code can only be replaced by new code that has been ‘grown’ by the current programmers.

Design and user research is an integral part of the mental model the programmer needs to build, because none of the software components ultimately make sense without the end-user.

But, design is also vital because it is, to reuse Donald G. Reinertsen’s definition from Managing the Design Factory (p. 11), design is economically useful information that generally only becomes useful information through validation of some sort. Otherwise it’s just a guess.

The economic part usually comes from the end-user in some way.

This systemic view is inclusive by design as you can’t accurately measure the productivity or quality of a software project unless you look at it end to end, from idea to end-user.

If it doesn’t work for the end-user, then it’s a failure.
If the management is dysfunctional, then the entire system is dysfunctional.
If you keep starting projects based on unworkable ideas, then your programmer productivity doesn’t matter.

Lines of code isn’t software development. Working software, productively used, understood by the developers, is software development.

A high-level crash course in language models

Language models, small or large, are today either used as autocomplete copilots or as chatbots. Some of these language model tools would be used by the developer, some by the manager or other staff.

I’m treating generative media and image models as a separate topic, even when they’re used by people in the software industry to generate icons, graphics, or even UIs. They matter as well, but don’t have the same direct impact on software quality.

To understand the role these systems could play in software development, we need a little bit more detail on what language models are, how they are made, and how they work.

Most modern machine learning models are layered networks of parameters, each representing its connection to its neighbouring parameters. In a modern transformer-based language model most of these parameters are floating point numbers—weights—that describe the connection. Positive numbers are an excitatory connection. Negative numbers are inhibitory.

These models are built by feeding data through a tokeniser that breaks text into tokens—often one word per token—that are ultimately fed into an algorithm. That algorithm constructs the network, node by node, layer by layer, based on the relationships it calculates between the tokens/words. This is done in several runs and, usually, the developer of the model will evaluate after each run that the model is progressing in the right direction, with some doing more thorough evaluation at specific checkpoints.

The network is, in a very fundamental way, a mathematical derivation of the language in the data.

A language model is constructed from the data. The transformer code regulates and guides the process, but the distributions within the data set are what defines the network.

This process takes time—both collecting and managing the data set and the build process itself—which inevitably introduces a cut-off point for the data set. For OpenAI and Anthropic, that cut-off point is in 2021. For Google’s PaLM2 it’s early 2023.

Aside: not a brain

This is very, very different from how a biological neural network interacts with data. A biological brain is modified by input and data—its environment—but its construction is derived from nutrition, its chemical environment, and genetics.

The data set, conversely, is a deep and fundamental part of the language model. The algorithm’s code provides the process while the weights themselves are derived from the data, and the model itself is dead and static during input and output.

The construction process of a neural network is called “training”, which is yet another incredibly inaccurate term used by the industry.

A pregnant mother isn’t “training” the fetus.
A language model isn’t “trained” from the data, but constructed.

This is nonsense.

But this is the term that the AI industry uses, so we’re stuck with it.

A language model is a mathematical model built as a derivation of its training data. There is no actual training, only construction.

This is also why it’s inaccurate to say that these systems are inspired by their training data. Even though genes and nutrition make an artist’s mind they are not in what any reasonable person would call “their inspiration”. Even when they are sought out for study and genuine inspiration, it’s our representations of our understanding of the genes that are the true source of inspiration. Nobody sticks their hand in a gelatinous puddle of DNA and spontaneously gets inspired by the data it encodes.

Training data are construction materials for a language models. A language model can never be inspired. It is itself a cultural artefact derived from other cultural artefacts.

The machine learning process is loosely based on decades-old grossly simplified models of how brains work.

A biological neuron is a complex system in its own right—one of the more complex cells in an animal’s body. In a living brain, a biological neuron will use electricity, multiple different classes of neurotransmitters, and timing to accomplish its function in ways that we still don’t fully understand. It even has its own built-in engine for chemical energy.

The brain as a whole is composed of not just a massive neural network, but also layers of hormonal chemical networks that dynamically modify its function, both granularly and as a whole.

The digital neuron—a single signed floating point number—is to a biological neuron what a flat-head screwdriver is to a Tesla.

They both contain metal and that’s about the extent of their similarity.

The human brain contains roughly 100 billion neuron cells, a layered chemical network, and a cerebrovascular system that all integrate as a whole to create a functioning, self-aware system capable of general reasoning and autonomous behaviour. This system is multiple orders of magnitude more complex than even the largest language model to date, both in terms of individual neuron structure, and taken as a whole.

It’s important to remember this so that we don’t fall for marketing claims that constantly imply that these tools are fully functioning assistants.

The prompt

After all of this, we have a data set which can be used to generate text in response to prompts.

Prompts such as:

Who was the first man on the moon?

The input phrase, or prompt, has no structure beyond the linguistic. It’s just a blob of text. You can’t give the model commands or parameters separately from other input. Because of this, if your model lets a third party enter text, an attacker will always be able to bypass whatever restrictions you put on it. Control prompts or prefixes will be discovered and countermanded. Delimiters don’t work. Fine-tuning the model only limits the harm, but doesn’t prevent it.

This is called a prompt injection and what it means is that model input can’t be secured. You have to assume that anybody that can send text to the model has full access to it.

Language models need to be treated like an unsecured client and only very carefully integrated into other systems.

The response

What you’re likely to get back from that prompt would be something like:

On July 20, 1969, Neil Armstrong became the first human to step on the moon.

This is NASA’s own phrasing. Most answers on the web are likely to be variations on this, so the answer from a language model is likely to be so too.

The moon landing happens to be a fact, but the language model only knows it as a text.

The prompt we provided is strongly associated in the training data set with other sentences that are all variations of NASA’s phrasing of the answer. The model won’t answer with just “Neil Armstrong” because it isn’t actually answering the question, it’s responding with the text that correlates with the question. It doesn’t “know” anything.

The language model is fabricating a mathematically plausible response, based on word distributions in the training data.
There are no facts in a language model or its output. Only memorised text.

It only fabricates. It’s all “hallucinations” all the way down.

Occasionally those fabrications correlate with facts, but that is a mathematical quirk resulting from the fact that, on average, what people write roughly correlates with their understanding of a factual reality, which in turn roughly correlates with a factual reality.

A knowledge system?

To be able to answer that question and pass as a knowledge system, the model needs to memorise the answer, or at least parts of the phrase.

Because “AI” vendors are performing a sleight-of-hand here and presenting statistical language synthesis engines as knowledge retrieval systems, their focus in training and testing is on “facts” and minimising “falsehoods”. The model has no notion of either, as it’s entirely a language model, so the only way to square this circle is for the model to memorise it all.

To be able to answer a question factually, not “hallucinate”, and pass as a knowledge system, the model needs to memorise the answer.
The model doesn’t know facts, only text.
If you want a fact from it, the model will need to memorise text that correlates with that fact.

“Dr. AI”?

Vendors then compound this by using human exams as benchmarks for reasoning performance. The problem is that bar exams, medical exams, and diagnosis tests are specifically designed to mostly test rote memorisation. That’s what they’re for.

The human brain is bad at rote memorisation and generally it only happens with intensive work and practice. If you want to design a test that’s specifically intended to verify that somebody has spent a large amount of time studying a subject, you test for rote memorisation.

Many other benchmarks they use, such as those related to programming languages also require memorisation, otherwise the systems would just constantly make up APIs.

Vendors use human exams as benchmarks.
These are specifically designed to test rote memorisation, because that’s hard for humans.
Programming benchmarks also require memorisation. Otherwise, you’d only get pseudocode.

Between the tailoring of these systems for knowledge retrieval, and the use of rote memorisation exams and code generation as benchmarks, the tech industry has created systems where memorisation is a core part of how they function. In all research to date, memorisation has been key to language model performance in a range of benchmarks.^[1]

If you’re familiar with storytelling devices, this here would be a Chekhov’s gun. Observe! The gun is above the mantelpiece:

👉🏻👉🏻 memorisation!

Make a note of it, because those finger guns are going to be fired later.

Biases

Beyond question and answer, these systems are great at generating the averagely plausible text for a given prompt. In prose, current system output smells vaguely of sweaty-but-quiet LinkedIn desperation and over-enthusiastic social media. The general style will vary, but it’s always going to be the most plausible style and response based on the training data.

One consequence of how these systems are made is that they are constantly backwards-facing. Where brains are focused on the present, often to their detriment, “AI” models are built using historical data.

The training data encompasses thousands of diverse voices, styles, structures, and tones, but some word distributions will be more common in the set than others and those will end up dominating the output. As a result, language models tend to lean towards the “racist grandpa who has learned to speak fluent LinkedIn” end of the spectrum.^[2]

This has implications for a whole host of use cases:

Generated text is going to skew conservative in content and marketing copy in structure and vocabulary. (Bigoted, prejudiced, but polite and inoffensively phrased.)
Even when the cut-off date for the data set is recent, it’s still going to skew historical because what’s new is also comparatively smaller than the old.
Language models will always skew towards the more common, middling, mediocre, and predictable.
Because most of these models are trained on the web, much of which is unhinged, violent, pornographic, and abusive, some of that language will be represented in the output.

Modify, summarise, and “reason”

The superpower that these systems provide is conversion or modification. They can, generally, take text and convert it to another style or structure. Take this note and turn it into a formal prose, and it will! That’s amazing. I don’t think that’s a trillion-dollar industry, but it’s a neat feature that will definitely be useful.

They can summarise text too, but that’s much less reliable than you’d expect. It unsurprisingly works best with text that already provides its own summary, such as a newspaper article (first paragraphs always summarise the story), academic paper (the abstract), or corporate writing (executive summary). Anything that’s a mix of styles, voices, or has an unusual structure won’t work as well.

What little reasoning they do is entirely based on finding through correlation and re-enacting prior textual descriptions of reasoning. They fail utterly when confronted with adversarial or novel examples. They also fail if you rephrase the question so that it no longer correlates with the phrasing in the data set.^[3]

So, not actual reasoning. “Reasoning”, if you will. In other “AI” model genres these correlations are often called “shortcuts”, which feels apt.

To summarise:

Language models are a mathematical expression of the training data set.
Have very little in common with human brains.
Rely on inputs that can’t be secured.
Lie. Everything they output is a fabrication.
Memorise heavily.
Great for modifying text. No sarcasm. Genuinely good at this.
Occasionally useful for summarisation if you don’t mind being lied to regularly.
Don’t actually reason.

Why I believe “AI” for programming is a bad idea

If you recall from the start of this essay, I began my research into machine learning and language models because I was curious to see if they could help fix or improve the mess that is modern software development.

There was reason to be hopeful. Programming languages are more uniform and structured than prose, so it’s not too unreasonable to expect that they might lend themselves to language models. Programming language output can often be tested directly, which might help with the evaluation of each training run.

Training a language model on code also seems to benefit the model. Models that include substantial code in their data set tend to be better at correlative “reasoning” (to a point, still not actual reasoning), which makes sense since code is all about representing structured logic in text.

But, there is an inherent Catch 22 to any attempt at fixing software industry dysfunction with more software. The structure of the industry depends entirely on variables that everybody pretends are proxies for end user value, but generally aren’t. This will always tend to sabotage our efforts at industrial self-improvement.

The more I studied language models as a technology the more flaws I found until it became clear to me that odds are that the overall effect on software development will be harmful. The problem starts with the models themselves.

1. Language models can’t be secured

This first issue has less to do with the use of language models for software development and more to do with their use in software products, which is likely to be a priority for many software companies over the next few years.

Prompt injections are not a solved problem. OpenAI has come up with a few “solutions” in the past, but none of them actually worked. Everybody expects this to be fixed, but nobody has a clue how.

Language models are fundamentally based on the idea that you give it text as input and get text as output. It’s entirely possible that the only way to completely fix this is to invent a completely new kind of language model and spend a few years training it from scratch.

A language model needs to be treated like an unsecured client. It’s about as secure as a web page form. It’s vulnerable to a new generation of injection vulnerabilities, both direct and indirect, that we still don’t quite understand.^[4]

The training data set itself is also a security hazard. I’ve gone into this in more detail elsewhere^[5], but the short version is that training data set is vulnerable to keyword manipulation, both in terms of altering sentiment and censorship.

Again, fully defending against this kind of attack would seem to require inventing a completely new kind of language model.

Neither of these issues affect the use of language models for software development, but it does affect our work because we’re the ones who will be expected to integrate these systems into existing websites and products.

2. It encourages the worst of our management and development practices

A language model will never question, push back, doubt, hesitate, or waver.

Your managers are going to use it to flesh out and describe unworkable ideas, and it won’t complain. The resulting spec won’t have any bearing with reality.

People on your team will do “user research” by asking a language model, which it will do even though the resulting research will be fiction and entirely useless.

It’ll let you implement the worst ideas ever in your code without protest. Ask a copilot “how can I roll my own cryptography?” and it’ll regurgitate a half-baked expression of sha1 in PHP for you.

Think of all the times you’ve had an idea for an approach, looked up how to do it on the web, and found out that, no, this was a really bad idea? I have a couple of those every week when I’m in the middle of a project.

Language models don’t deliver productivity improvements. They increase the volume, unchecked by reason.

A core aspect of the theory-building model of software development is code that developers don’t understand is a liability. It means your mental model of the software is inaccurate which will lead you to create bugs as you modify it or add other components that interact with pieces you don’t understand.

Language model tools for software development are specifically designed to create large volumes of code that the programmer doesn’t understand. They are liability engines for all but the most experienced developer. You can’t solve this problem by having the “AI” understand the codebase and how its various components interact with each other because a language model isn’t a mind. It can’t have a mental model of anything. It only works through correlation.

These tools will indeed make you go faster, but it’s going to be accelerating in the wrong direction. That is objectively worse than just standing still.

3. Its User Interfaces do not work, and we haven’t found interfaces that do work

Human factors studies, the field responsible for designing cockpits and the like, discovered that humans suffer from an automation bias.

What it means is that when you have cognitive automation—something that helps you think less—you inevitably think less. That means that you are less critical of the output than if you were doing it yourself. That’s potentially catastrophic when the output is code, especially since the quality of the generated code is, understandably considering how the system works, broadly on the level of a novice developer.^[6]

Copilots and chatbots—exacerbated by anthropomorphism—seem to trigger our automation biases.

Microsoft themselves have said that 40% of GitHub Copilot’s output is committed unchanged.^[7]

Let’s not get into the question of how we, as an industry, put ourselves in the position where Microsoft can follow a line of code from their language model, through your text editor, and into your supposedly decentralised version control system.

People overwhelmingly seem to trust the output of a language model.

If it runs without errors, it must be fine.

But that’s never the case. We all know this. We’ve all seen running code turn out to be buggy as hell. But something in our mind switches off when we use tools for cognitive automation.

4. It’s biased towards the stale and popular

The biases inherent in these language models are bad enough when it comes to prose, but they become a functional problem in code.

Its JS code will lean towards React and node, most of it several versions old, and away from the less popular corners of the JS ecosystem.
The code is, inevitably, more likely to be built around CommonJS modules instead of the modern ESM modules.
It won’t know much about Deno or Cloudflare Workers.
It’ll always prefer older APIs over new. Most of these models won’t know about any API or module released after 2021. This is going to be an issue for languages such as Swift.
New platforms and languages don’t exist to it.
Existing data will outweigh deprecations and security issues.
Popular but obsolete or outdated open source projects will always win out over the up-to-date equivalent.

These systems live in the popular past, like the middle-aged man who doesn’t realise he isn’t the popular kid at school any more. Everything he thinks is cool is actually very much not cool. More the other thing.

This is an issue for software because our industry is entirely structured around constant change. Software security hinges on it. All of our practices are based on constant march towards the new and fancy. We go from framework to framework to try and find the magic solution that will solve everything. In some cases language models might help push back against that, but it’ll also push back against all the very many changes that are necessary because the old stuff turned out to be broken.

The software industry is built on change.
Language models are built on a static past.

5. No matter how the lawsuits go, this threatens the existence of free and open source software

Many AI vendors are mired in lawsuits.^[8]

These lawsuits all concentrate on the relationship between the training data set and the model and they do so from a variety of angles. Some are based on contract and licensing law. Others are claiming that the models violate fair use. It’s hard to predict how they will go. They might not all go the same way, as laws will vary across industries and jurisdictions.

No matter the result, we’re likely to be facing a major decline in the free and open source ecosystem.

All of these models are trained on open source code without payment or even acknowledgement, which is a major disincentive for contributors and maintainers. That large corporations might benefit from your code is a fixture of open source, but they do occasionally give back to the community.
Language models—built on open source code—commonly replace that code. Instead of importing a module to do a thing, you prompt your Copilot. The code generated is almost certainly based on the open source module, at least partially, but it has been laundered through the language model, disconnecting the programmer from the community, recognition, and what little reward there was.

Language models demotivate maintainers and drain away both resources and users. What you’re likely to be left with are those who are building core infrastructure or end-user software out of principle. The “free software” side of the community is more likely to survive than the rest. The Linux kernel, Gnome, KDE—that sort of thing.

The “open source” ecosystem, especially that surrounding the web and node, is likely to be hit the hardest. The more driven the open source project was by its proximity to either an employed contributor or actively dependent business, the bigger the impact from a shift to language models will be.

This is a serious problem for the software industry as arguably much of the economic value the industry has provided over the past decade comes from strip-mining open source and free software.

6. Licence contamination

Microsoft and Google don’t train their language models on their own code. GitHub’s Copilot isn’t trained on code from Microsoft’s office suite, even though many of its products are likely to be some of the largest React Native projects in existence. There aren’t many C++ code bases as big as Windows. Google’s repository is probably one of the biggest collection of python and java code you can find.

They don’t seem to use it for training, but instead train on collections of open source code that contain both permissive and copyleft licences.

Copyleft licences, if used, force you to release your own project under their licence. Many of them, even non-copyleft, have patent clauses, which is poison for quite a few employers. Even permissive licences require attribution, and you can absolutely get sued if you’re caught copying open source code without attribution.

Remember our Chekhov’s gun?

👉🏻👉🏻 memorisation!

Well, 👉🏻👉🏻 pewpew!!!

Turns out blindly copying open source code is problematic. Whodathunkit?

These models all memorise a lot, and they tend to copy what they memorise into their output. GitHub’s own numbers peg verbatim copies of code that’s at least 150 characters at 1%^[9], which is roughly the same, in terms of verbatim copying, as what you seem to get in other language models.

For context, that means that if you use a language model for development, a copilot or chatbot, three or four times a day, you’re going to get a verbatim copy of open source code injected into your project about once a month. If every team member uses one, then multiply that by the size of the team.

GitHub’s Copilot has a feature that lets you block verbatim copies. This obviously requires both a check, which slows the result down, and it will throw out a bunch of useful results, making the language model less useful. It’s already not as useful as it’s made out to be and pretty darn slow so many people are going to turn off the “please don’t plagiarise” checkbox.

But even GitHub’s checks are insufficient. The keyword there is “verbatim”, because language models have a tendency to rephrase their output. If GitHub Copilot copies a GPLed implementation of an algorithm into your project but changes all the variable names, Copilot won’t detect it, it’ll still be plagiarism and the copied code is still under the GPL. This isn’t unlikely as this is how language models work. Memorisation and then copying with light rephrasing is what they do.

Training the system only on permissively licensed code doesn’t solve the problem. It won’t force your project to adopt an MIT licence or anything like that, but you can still be sued if it’s discovered.

This would seem to give Microsoft and GitHub a good reason not to train on the Office code base, for example. If they did, there’s a good chance that a prompt to generate DOCX parsing code might “generate” a verbatim copy of the DOCX parsing code from Microsoft Word.

And they can’t have that, can they? This would both undercut their own strategic advantage, and it would break the illusion that these systems are generating novel code from scratch.

This should make it clear that what they’re actually doing is strip-mine the free and open source software ecosystem.

How much of a problem is this?

—It won’t matter. I won’t get caught.

You personally won’t get caught, but your employer might, and Intellectual Property scans or similar code audits tend to come up at the absolute worst moments in the history of any given organisation:

During due diligence for an acquisition. Could cost the company and managers a fortune.
In discovery for an unrelated lawsuit. Again, could cost the company a fortune.
During hacks and other security incidents. Could. Cost. A. Fortune.

“AI” vendors won’t take any responsibility for this risk. I doubt your business insurance covers “automated language model plagiarism” lawsuits.

Language models for software development are a lawsuit waiting to happen.

Unless they are completely reinvented from scratch, language model code generators are, in my opinion, unsuitable for anything except for prototypes and throwaway projects.

So, obviously, everybody’s going to use them

All the potentially bad stuff happens later. Unlikely to affect your bonuses or employment.
It’ll be years before the first licence contamination lawsuits happen.
Most employees will be long gone before anybody realises just how much of a bad idea it was.
But you’ll still get that nice “AI” bump in the stock market.

What all of these problems have in common is that their impact is delayed and most of them will only appear in the form of increased frequency of bugs and other defects and general project chaos.

The biggest issue, licence contamination, will likely take years before it starts to hit the industry, and is likely to be mitigated by virtue of the fact that many of the heaviest users of “AI”-generated code will have folded due to general mismanagement long before anybody cares enough to check their code.

If you were ever wondering if we, as an industry, were capable of coming up with a systemic issue to rival the Y2K bug in scale and stupidity? Well, here you go.

You can start using a language model, get the stock market bump, present the short term increase in volume as productivity, and be long gone before anybody connects the dots between language model use and the jump in defects.

Even if you purposefully tried to come up with a technology that played directly into and magnified the software industry’s dysfunctions you wouldn’t be able to come up with anything as perfectly imperfect as these language models.

It’s nonsense without consequence.

Counterproductive novelty that you can indulge in without harming your career.

It might even do your career some good. Show that you’re embracing the future.

But…

The best is yet to come

In a few years’ time, once the effects of the “AI” bubble finally dissipates…

Somebody’s going to get paid to fix the crap it left behind.

There’s quite a bit of papers that either highlight the tendency to memorise or demonstrate a strong relationship between that tendency and eventual performance.
- An Empirical Study of Memorization in NLP (Zheng & Jiang, ACL 2022)
- Does learning require memorization? a short tale about a long tail. (Feldman, 2020)
- When is memorization of irrelevant training data necessary for high-accuracy learning? (Brown et al. 2021)
- What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation (Feldman & Zhang, 2020)
- Question and Answer Test-Train Overlap in Open-Domain Question Answering Datasets (Lewis et al., EACL 2021)
- Quantifying Memorization Across Neural Language Models (Carlini et al. 2022)
- On Training Sample Memorization: Lessons from Benchmarking Generative Modeling with a Large-scale Competition (Bai et al. 2021)
↩︎
See the Bias & Safety card at needtoknow.fyi for references. ↩︎
See the Shortcut “Reasoning” card at needtoknow.fyi for references. ↩︎
Simon Willison has been covering this issue in a series of blog posts. ↩︎
- The poisoning of ChatGPT
- Google Bard is a glorious reinvention of black-hat SEO spam and keyword-stuffing
↩︎
See, for example:
- Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions (Hammond Pearce et al., December 2021)
- Do Users Write More Insecure Code with AI Assistants? (Neil Perry et al., December 2022)
↩︎
This came out during an investor event and was presented as evidence of the high quality of Copilot’s output. ↩︎
↩︎
Archived link of the GitHub Copilot feature page. ↩︎