Out of the Software Crisis

Google Bard is a glorious reinvention of black-hat SEO spam and keyword-stuffing

By Baldur Bjarnason,

According to former Google researchers, it looks like the Bard chatbot has a glaring keyword manipulation exploit open to any black-hat SEO who wants to try.

In the ancient days, when mammoths prowled, search engines trusted the text in the page

I’m old enough to not only remember what the web was like before Google, I remember what it was like before AltaVista, which was the Google before Google.

For those of you who aren’t internet ancients, in the early days of the web finding things was made easy by the fact that there just weren’t that many websites. You could list most websites in a manageable, human-curated directory. When I made my first website, a collection of essays on comics as literature (what can I say, I’ve always been a nerd), I submitted it to a few online directories and overnight it got traffic. I got my first email from a reader the next morning.

But, the web continued to explode in popularity, and before long the directories weren’t handling the sheer volume. People started looking for search engines—no, demanding search engines. The web wasn’t working without one.

Quite a few companies tried, but the first one to deliver what looked like a solid experience as well as decent quality results was AltaVista. Everybody loved it. Everybody switched to it. For a while.

But it had a fatal flaw: it trusted the text in the page. Not only did it trust the text on the page, it was the primary factor it used in deciding whether the page was relevant to the query or not. A key part of that was the infamous “meta keyword tag”. Developers today know meta tags as a fairly innocuous, if awkward, method for injecting metadata into pages. Services then use it for previews and the like. But back in the day, what you had in the meta keyword tag decided where your page landed in the search engine results.

No, it was all down to the meta tag, so every sleazy marketdroid on the web stuffed theirs. AltaVista’s search results were filled with irrelevant content. Or, no content, when they clamped down on keyword-stuffing, because it turns out that even then, the web was dominated by sleazy marketdroids.

Trusting and prioritising the meta tag was a security vulnerability of sorts, one that Google avoided from the start, and AltaVista only dropped in 2002. As a Search Engine Watch article explained when AltaVista dropped their “support”:

The first major crawler-based search engines to use the meta keywords tag were Infoseek and AltaVista. It’s unclear which one provided support first, but both were offering it in early 1996. When Inktomi launched in mid-1996 through the HotBot search engine, it also provided support for the tag. Lycos did the same in mid-1997, taking support up to four out of the seven major crawlers at the time (Excite, WebCrawler and Northern Light did not provide support).

The ascendancy of the tag did not last after 1997. Experience with the tag has showed it to be a spam magnet. Some web site owners would insert misleading words about their pages or use excessive repetition of words in hopes of tricking the crawlers about relevancy. For this reason, Excite (which also owned WebCrawler) resisted added support. Lycos quietly dropped its support of the tag in 1998, and newer search engines such as Google and FAST never added support at all.

After Infoseek (Go.com) closed in 2000, the meta keywords tag was left with only two major supporters: AltaVista and Inktomi. Now Inktomi remains the only one, with AltaVista having dropped its support in July, the company says.

Why does this matter today? Surely, nobody would be dumb enough to build an information management system that is so utterly, completely open to keyword manipulation?

Well…

Language models are to modern search what the meta tag was to AltaVista

Last week I wrote about The Poisoning of ChatGPT and how researchers had, in recent years, discovered that language models can be poisoned through their training data—both the data used in the initial training and fine-tuning.

The researchers managed to do both keyword manipulation and degrade output with as few as a hundred toxic entries, and they discover that large models are less stable and more vulnerable to poisoning. They also discovered that preventing these attacks is extremely difficult, if not realistically impossible.

Of course, because they are AI researchers and the entire field has fundamental issues with finding accurate names for complex topics, the industry has decided to call these attacks poisonings when most of the poisoning attacks they outline are more properly keyword manipulation exploits.

You know… literally the job description of a black-hat SEO.

Moreover, researchers have also discovered that it’s probably mathematically impossible to secure the training data for a large language model like GPT-4 or PaLM 2. This was outlined in a research paper that Google themselves tried to censor, an act that eventually led the Google-employed author, El Mahdi El Mhamdi, to leave the company. The paper has now been updated to say what the authors wanted it to say all along, and it’s a doozy.

This paper emphasized three characteristics of the data on which LAIMs are trained. Namely, they are mostly user-generated, very high-dimensional and heterogeneous. Unfortunately, the current literature on secure learning, which we reviewed, shows that these features make LAIMs inherently vulnerable to privacy and poisoning attacks. Large AI models are bound to be dangerous. Their rushed deployment, especially at scale, poses a serious threat to justice, public health and to national and international security.

The only realistic way to defend against poisoning is to use stale training data. As soon as you start to include fresh pages in a data set this large, you de facto lose the ability to defend the integrity of the data set and with it the integrity of the language model’s output.

Major language model vendors, such as OpenAI, have decided to sacrifice “freshness” in order to preserve what little integrity their systems have in the first place—remember hallucinations are still an unsolved problem.

Except Google. They have decided that “freshness” is in their corporate DNA. They want to be up-to-date at all costs, so their training data now goes all the way up to February 2023, and I have no doubt they plan on keeping it as “fresh” as possible, with each update likely bring on ever multiplying attempts to manipulate their model.

Google is rushing ahead to “catch up” on AI without paying any attention to the security or integrity of its products, something that its own employees, past and present, have been warning it about.

They are ignoring the acute vulnerability that large language models have with keyword manipulation exploits, making them the modern equivalent of the search engines of the 90s. The only thing that’s different today is that there is now much more money in manipulating search engines than ever before, which makes the vulnerability of large language models a lethal issue for search, research, or information management at scale.

But Google doesn’t care because they want that AI stock price bump. That’s all that matters. They don’t even see how they’re marching down the same road that AltaVista went down twenty-five years ago.

If we’re lucky, Google Bard will flop and none of this will ever become an issue.

But, if we’re really unlucky, then the future of search is LLMs and rampant keyword manipulation.

The best way to support this newsletter or my blog is to buy one of my books, The Intelligence Illusion: a practical guide to the business risks of Generative AI or Out of the Software Crisis.

Join the Newsletter

Subscribe to the Out of the Software Crisis newsletter to get my weekly (at least) essays on how to avoid or get out of software development crises.

Join now and get a free PDF of three bonus essays from Out of the Software Crisis.

    We respect your privacy.

    Unsubscribe at any time.