The truth about the EU AI Act and foundation models, or why you should not rely on ChatGPT summaries for important texts

For those of you who are all “just the facts”

The EU AI Act is still subject to change, although most people don’t expect too many major changes at this point. Any of this could change.

Also, IANAL. I don’t even play one on TV.

Safe-harbour provisions for service provider liability are unaffected by the EU’s AI Act. Hosting rules are unaffected.
Developers (not deployers) of foundation models need to register their models, with documentation, prior to making it available on the market or as a service.
Foundation models need to come with documentation about their training data set and pass a number of to-be-implemented standardised benchmarks that examine the suitability of the data they use in terms of biases and other factors.
The developers of a foundation model are responsible for compliance, not the deployers.
Providers of Generative AI systems are required to document and publish detailed summaries of the copyright-protected training data they used, as a part of the registration process.
The Act is clearly designed to benefit AI research through increased transparency and documentation.
It bans a bunch of things that shouldn’t have been allowed in the first place.
If you take a foundation model, fine-tune it for a specialised purpose, and deploy it as a part of your software, it won’t count as a foundation model, and you’ll probably be fine, as long as the original provider of the foundation model was compliant.
If you’re using a foundation model over an API to add a specialised feature to your software, then you’ll probably be fine, as long as the original developer was compliant.

The AI Act covers a lot. It covers the use of AI for biometric identification, high-risk systems whose intended purpose involves people’s health and safety (or life and liberty), foundation models, generative AI, and your run-of-the-mill AI/ML software. It’s also painfully aware that these are early days and that regulators need to be flexible.

The focus of this essay is just foundation models and generative AI, and even with that narrow focus it’s already much too long.

The AI industry is having a temper tantrum

If you’ve been paying attention to tech social media over the past few days, you’ll have seen the outcry about the EU’s proposed AI Act.

The act isn’t final. It’s still subject to negotiation between various parts of the EU infrastructure and how it gets implemented can also change its effect in substantial ways.

That isn’t preventing the US tech industry from panicking. In a blog post that was later popularised by a noted tech commentator, AI enthusiasts have claimed that the EU is doing several very bad, double-plus ungood things and, with it, we Europeans are dooming ourselves to something or the other:

~~They’re banning open source AI models!~~
~~It’ll be illegal to host AI models or code!~~
~~They’re banning AI models accessed via an API.~~
~~They’re banning fine-tuning of foundation models!~~

I’ve struck out the statements in the list above because, unfortunately for those who like a good panic, none of them seem to be true. With the act and the recent actions by GDPR regulators, the EU has joined AI ethicists such as Emily M. Bender, Timnit Gebru, and others on the tech industry’s Enemies of AI list.

The crimes of the ethicists, according to tech:

A refusal to believe in an unfounded expectation of endless exponential growth.
An insistence that models be evaluated based on genuine, not imagined, functionality.
The clearly irrational belief that AI development should be transparent, sustainable, and avoid harming the societies we live in.

The EU’s crimes:

A hatred of innovation and the future.
An insistence on legislating themselves into the stone age.
A completely irrational disbelief in the wonders provided so generously by the glorious, kind, and all-around awesome people in the tech industry.

Or, something.

It’s hard to keep track of industry and investor consensus now that bubble mania has set it, especially since quite a few of them are so helpfully using ChatGPT to generate fact-free incoherence for them.

(Imagine a meme of a greying scruffy dog turning its head to one side and going “roo?”. That’s me trying to parse some of the social media posts coming from AI fans. Most of it’s just “what?”)

I’m going to ignore the tantrums and instead have a look, for myself, at what the current proposal for the Act says. For this I’m using the consolidated PDF document of the amended act as published by the European Parliament as a reference.

Scope and service provider liability

Right at the outset of the act in Article 2: Scope, it makes it clear that it doesn’t intend to override existing safe-harbour laws for service providers:

5. This Regulation shall not affect the application of the provisions on the liability of intermediary service providers set out in Chapter II, Section IV of Directive 2000/31/EC of the European Parliament and of the Council6 [as to be replaced by the corresponding provisions of the Digital Services Act].

5b. This Regulation is without prejudice to the rules laid down by other Union legal acts related to consumer protection and product safety.

5c. This Regulation shall not preclude Member States or the Union from maintaining or introducing laws, regulations or administrative provisions which are more favourable to workers in terms of protecting their rights in respect of the use of AI systems by employers, or to encourage or allow the application of collective agreements which are more favourable to workers.

5d. This Regulation shall not apply to research, testing and development activities regarding an AI system prior to this system being placed on the market or put into service, provided that these activities are conducted respecting fundamental rights and the applicable Union law. The testing in real world conditions shall not be covered by this exemption. The Commission is empowered to may adopt delegated acts in accordance with Article 73 to specify this exemption to prevent its existing and potential abuse. The AI Office shall provide guidance on the governance of research and development pursuant to Article 56, also aiming at coordinating its application by the national supervisory authorities.

5d. This Regulation shall not apply to AI components provided under free and opensource licences except to the extent they are placed on the market or put into service by a provider as part of a high-risk AI system or of an AI system that falls under Title II or IV. This exemption shall not apply to foundation models as defined in Art 3.

The first and most important part here is clause 5.

“Chapter II, Section IV of Directive 2000/31/EC” is the EU’s version of Section 230 that governs “liability of intermediary service providers”. It covers hosting, “mere conduit” providers, caching, and forbids member states from imposing a general obligation to monitor on service providers. The AI Act specifically says that it does not affect the liability of intermediate service providers.

This means that, yes, GitHub and other code repositories are still allowed to host AI model code. Hosting providers don’t have any additional liability under the AI Act, only the providers of the models themselves and those who deploy them.

Existing rules about hosting still apply. Same as it’s been for the past twenty-three years.

Clauses 5d are probably the source of some of the tech industry’s confusion and anger. I’m guessing they interpret (or ChatGPT interpreted for them) the “this exemption shall not apply to foundation models” as applying to all the clauses from 5 to 5d, so they assume that none of those exceptions apply to foundation models, which would mean that the safe-harbour provision is indeed overridden.

That interpretation makes no sense because that would also mean that clauses 5b and 5c would also get dropped

5c in particular is about the EU reserving the right of member states to introduce further laws to protect labour from employers abusing AI software.

I can guarantee you that the Act isn’t intended to prevent the EU from making further legislation on foundation models.

The EU is also quite fond of it’s consumer protection laws and wouldn’t give foundation models a pass on those.

This means that interpreting “shall not apply to foundation models” as applying to all the exceptions is almost certainly nonsense.

There’s also a chance that people in the tech industry think that Article 10, which sets out strict data governance rules, applies to foundation models, but that article is in Chapter 2: Requirements for high-risk AI systems.

The act makes it clear that “foundation” and “high-risk” are two distinct categories and that articles 8-15 apply to high-risk systems and not foundation models and that their obligations are separate (p. 143).

For high-risk AI systems, the general principles are translated into and complied with by providers or deployers by means of the requirements set out in Articles 8 to 15, and respective obligations laid down in Chapter 3 of Title III of this Regulation. For foundation models, the general principles are translated into and complied with by providers by means of the requirements set out in Articles 28 to 28b.

And from page 29:

These specific requirements and obligations do not amount to considering foundation models as high risk AI systems.

What 5d means is that the pre-release development of foundation models has to follow the rules set out in the regulation on foundation models.

That would seem to mean that there are requirements for foundation models that you need to follow during model training in addition to those that come into effect once you put it into service, which is when the regulation kicks in for other AI models.

That very much isn’t a ban of any kind, but maybe the rules and requirements are onerous? Maybe that’s why the panic?

But first, what do the words mean?

We need to find out what the EU AI Act means with things like “foundation model”, “provider”, and “deployer”.

From Article 3:

(1c) ‘foundation model’ means an AI model that is trained on broad data at scale, is designed for generality of output, and can be adapted to a wide range of distinctive tasks;

That seems to match the industry’s definition of the term. You could quibble that this is a bad way of describing these models in the first place, but that’s generally not a debate that EU regulators are going to get involved in. As far as I can tell, they usually prefer to reuse industry terms, possibly with a little more specificity, when they can.

Also, from page 28:

Pretrained models developed for a narrower, less general, more limited set of applications that cannot be adapted for a wide range of tasks such as simple multipurpose AI systems should not be considered foundation models for the purposes of this Regulation, because of their greater interpretability which makes their behaviour less unpredictable.

That lets many fine-tuned models off the hook.

Back to Article 3:

(23) ‘substantial modification’ means a modification or a series of modifications of the AI system after its placing on the market or putting into service which is not foreseen or planned in the initial risk assessment by the provider and as a result of which the compliance of the AI system with the requirements set out in Title III, Chapter 2 of this Regulation is affected or results in a modification to the intended purpose for which the AI system has been assessed

This is an important note because model types that need to be registered (high-risk and foundation) also need to be re-registered after every substantial modification, which some have interpreted as a ban on a variety of approaches to ongoing model improvement. This explains that these methods for ongoing fine-tuning or learning do not force you to re-register the model, because those modifications are foreseen. The same thing applies to security updates and modifications geared towards the ongoing mitigation of misuse and bias.

If you’re familiar with semantic versioning, you probably only need to register major versions.

(2) ‘provider’ means a natural or legal person, public authority, agency or other body that develops an AI system or that has an AI system developed with a view to placing it on the market or putting it into service under its own name or trademark, whether for payment or free of charge;

“Provider” seems to mean whichever legal entity is developing an AI system, which doesn’t necessarily have to be the same entity as the one who deploys it. “Placing on the market” in this context means the EU market. You can alpha- or beta-test non-foundation models on US customers as much as you like and the EU won’t care.

(4) ‘deployer’ means any natural or legal person, public authority, agency or other body using an AI system under its authority, except where the AI system is used in the course of a personal non-professional activity.

Most of the requirements the EU sets are on providers not deployers. If the foundation model is compliant and registered, then the organisations who deploy and use them should be fine.

The rules everybody has to follow

The act sets out general principles all AI models should follow—that providers should “make their best efforts” to follow.

They all seem innocuous (from p. 143):

“AI systems shall be developed and used as a tool that serves people, respects human dignity and personal autonomy, and that is functioning in a way that can be appropriately controlled and overseen by humans.”
“AI systems shall be developed and used in a way to minimize unintended and unexpected harm as well as being robust in case of unintended problems and being resilient against attempts to alter the use or performance of the AI system so as to allow unlawful use by malicious third parties.”
“AI systems shall be developed and used in compliance with existing privacy and data protection rules, while processing data that meets high standards in terms of quality and integrity.”
“AI systems shall be developed and used in a way that allows appropriate traceability and explainability, while making humans aware that they communicate or interact with an AI system as well as duly informing users of the capabilities and limitations of that AI system and affected persons about their rights.”
“AI systems shall be developed and used in a way that includes diverse actors and promotes equal access, gender equality and cultural diversity, while avoiding discriminatory impacts and unfair biases that are prohibited by Union or national law.”
“AI systems shall be developed and used in a sustainable and environmentally friendly manner as well as in a way to benefit all human beings, while monitoring and assessing the long-term impacts on the individual, society and democracy.”

Maybe I’m wrong, but none of this looks like world-ending stuff, and most of it is fairly close to what you’re seeing regulators in other territories talk about. At least the ones that haven’t fallen for the AGI sci-fi nonsense the industry is peddling.

The notable bit is the requirement that users should be properly informed when they’re interacting with an AI system. This comes up again in Article 52: Transparency obligations for certain AI systems and repeated in the foundation model requirements, which would seem to indicate that EU regulators consider informed consent by the end-user to be rather quite important.

The word “appropriate” is used in the other two clauses that are genuinely AI specific, which is going to be implementation-specific, based largely on researcher and industry feedback, and likely make them pretty close to toothless in practice. The rest is vague enough to boil down to “please follow existing regulations and laws, even though you think AI should be exempt because it’s so cool”.

You also have a list of prohibited practices that are set out in Article 5. Those boil down to:

Subliminal manipulation or intentionally distorting human behaviour in a material way that’s likely to cause harm.
It’ll ban phrenological-style applications such as does a person with this skull shape do crime?. These have been popular with law enforcement. Say goodbye to “this AI detects homosexuality” or “this AI detects sociopathy” kind of pseudoscientific nonsense systems.
Specifically targeting vulnerable sections of the population.
Social credit scoring.
Real-time biometric identification of people in public spaces.
Untargeted scraping of facial images for the purposes of expanding facial recognition databases.

None of those seem to apply to foundation or generative models, but you can already tell why the tech industry hates this proposed act. This is like a list of all their favourite things. Banning phrenology is to AI industry investors about as evil as shoving kittens into a meat grinder.

To sleazy VC types it’s like setting a law that bans puppy dogs and rainbows.

The foundation model requirements

At last, we’re getting to the proper stuff. Foundations models, what’s it all about?

The requirements specific to the providers of foundation models are outlined in Article 28b on pages 39–41 of the document linked to by the European Parliament news item.

The first requirement is just basic risk assessment and mitigation. “Demonstrate through appropriate design, testing and analysis that the identification, the reduction and mitigation of reasonably foreseeable risks to health, safety, fundamental rights, the environment and democracy and the rule of law prior and throughout development with appropriate methods such as with the involvement of independent experts, as well as the documentation of remaining non-mitigable risks after development.”
“Process and incorporate only datasets that are subject to appropriate data governance measures for foundation models, in particular measures to examine the suitability of the data sources and possible biases and appropriate mitigation;”
“Design and develop the foundation model in order to achieve throughout its lifecycle appropriate levels of performance, predictability, interpretability, corrigibility, safety and cybersecurity.”
“Design and develop the foundation model, making use of applicable standards to reduce energy use, resource use and waste, as well as to increase energy efficiency, and the overall efficiency of the system.”
“Draw up extensive technical documentation and intelligible instructions for use in order to enable the downstream providers to comply with their obligations.”
“Establish a quality management system to ensure and document compliance with this Article, with the possibility to experiment in fulfilling this requirement”
“Register that foundation model in the EU database referred to in Article 60.”

If I’m to be brutally honest, except for registration, the requirements above are basically what you were supposed to do when selling a large-scale machine learning system to an enterprise or institution a short while ago, before bubble-mania kicked in. It’s the sort of stuff you should be doing anyway.

The only difference here is that OpenAI, Microsoft, and Google now all think it’s a strategic advantage to keep all of it secret, even though that secrecy directly threatens the viability of AI research and cripples the ability of their customers to assess and plan around the limitations of their products.

Forcing AI vendors to publish this information is an obvious benefit to all of us, even them, because their AI systems are the product of AI research and the secrecy they are currently employing risks turning the entire field into a dead end, and trigger another “AI winter”.

More importantly, forcing them to gather this data and documentation and making it available to others is only going to be a benefit for the AI industry in general, in the long term.

Just look at how quickly open source developers managed to replicate the approaches and strategies from Facebook’s LLaMA model. This would be like that, just on steroids.

This is exciting, not scary.

The act also has a requirement where the EU AI office, in collaboration with international partners, has to “develop cost-effective guidance and capabilities to measure and benchmark aspects of AI systems and AI components, and notably of foundation models relevant to the compliance and enforcement of this Regulation based on the generally acknowledged state of the art” (p. 92).

This puts the onus on the EU to provide straightforward benchmarks—most likely based on existing benchmarks in AI research (see “generally acknowledged state of the art”) or organisations like Hugging Face—that providers can use when developing foundation models.

Given that most providers of existing foundation models are already using benchmarks to guide their development work, and that they’ll almost certainly have a say in the development of these benchmarks, this doesn’t seem that problematic.

In fact, you could argue that it’s much too lax considering the misbehaviour of the tech industry over that past decade.

AI ethicists to the rescue

The AI industry loves to hate AI ethicists.

The doomers have a purpose: they are—in effect—constantly talking up the capabilities of these systems and the “geniuses” who make them.

But, ethicists? Focusing on existing, not hypothetical, harms? Insistent on talking about models in terms of their genuine, not imagined, capabilities? Transparency? Consent?

Ugh.

Clearly, these are some very bad people who just hate technology.

What the AI industry and hangers-on are missing is that AI ethicists are possibly the most constructive force in the field of AI research today.

Anybody who is trying to stop you from getting behind the wheel of a car when you’re drunk is your friend, not your enemy.

The industry today is vastly over-promising on the capabilities of their AI systems. They are shipping them without any meaningful safeguards or acknowledgement of how they’re harming our digital commons, creative industries, minorities, or how they are the perfect tool for misinformation at scale.

The risk is enormous and directly threaten the AI vendors themselves. Universal misinformation and a collapsed digital commons is an existential threat to a search engine. The creative industries are some of the biggest software customers around—replacing million dollar customers with twenty dollar customers is just bad business. Language and diffusion model abuses harm tech companies just as much in the long term as it does the rest of us.

The people warning you to not make these mistakes are your allies. They’re fighting in your corner, but you keep punching them in the back.

What’s more, they’re likely to save you—or at least open source models—from the EU AI Act by making compliance as good as automatic.

The documentation requirements might seem onerous, but documentation and transparency is also a hard requirement for the advancement of AI research in general, so it shouldn’t come as a surprise that a lot of work has already been done.

AI researchers, ethicists, and Hugging Face in particular have accomplished a lot, with more no doubt on the way.

Their work includes:

Margaret Mitchell with Model Cards for Model Reporting implemented as Model Cards at Hugging Face.
Emily M. Bender and Batya Friedman with Data Statements for Natural Language Processing.
Timnit Gebru et al. with Datasheets for Datasets.
Julia Stoyanovich and Bill Howe with Nutritional Labels for Data and Models
Kasia S. Chmielinski, Sarah Newman, et al. with The Dataset Nutrition Label (2nd Gen): Leveraging Context to Mitigate Harms in Artificial Intelligence
Hugging Face have shipped a tool with the goal of at least partially automating EU AI Act compliance checks: Model Card Regulatory Check or RegCheck AI.
Hugging Face, in particular, seems to have broadly good intentions: What does ethical AI look like?

AI researchers have been preparing for this for years because, as I said, documentation and transparency is essential for the field to progress.

It’s OpenAI, Microsoft, and Google who are holding the industry back with their secrecy and risk-taking.

It seems more likely than not that major open source language models will be broadly compliant with the EU AI Act well before the act takes effect. Researchers are setting standards and processes for gathering and presenting the documentation and the ethics team at Hugging Face seems to be putting it into practice.

Seriously, in terms of reducing the cost of regulatory compliance for the industry, and in terms of broadly increasing quality through assessment of bias and functionality, the value that AI ethicists are creating for the industry is enormous. Open source models likely won’t be viable for serious use without them.

All the industry accomplishes by demonising this group is increase their future liabilities and reduce the long term value of their products.

But, I saved the best for last

There is one additional set of requirements for generative foundation models (p. 41). They need to:

Comply with transparency requirements, making sure that generative output is correctly labelled, similar to what Adobe is already planning to do with their generative image output.
Prevent content that’s illegal in the EU, such as child abuse imagery.
Make available a “detailed summary of the use of training data protected under copyright law.”

This is where most existing proprietary foundation models would end up getting banned in the EU.

The labelling requirement is fairly straightforward and is a requirement that’s likely to be echoed in many other jurisdictions, anyway.

The illegal content requirements aren’t that much of an issue as most existing providers try to prevent that kind of output, anyway.

Open source models will be just fine on the “detailed summary” front as their training data set isn’t a secret.

But GPT-4 and PaLM? Yeah, the third requirement is where they both get stomped. Not because they can’t. It’s highly likely that Google and Microsoft have more than enough documentation to provide a detailed summary of the copyright-protected training data they used. If they don’t, then they are incompetent and should get stomped hard, then investigated and fined.

They probably have that documentation somewhere. They just really don’t want to publish it because it’s almost certainly all copyright-protected material. Whatever public domain or freely licensed data they’ve used is only going to be a small part of the big models.

The data belongs to others, many of whom also happen to be directly threatened by OpenAI and Google introducing generative AI, or at the very least will want their cut of the AI bubble pie.

That’s a recipe for an avalanche of major lawsuits from big copyright-holding corporations.

That’s why you’re going to hear a lot of scaremongering about the EU AI Act from all over the tech industry.

The best way to support this newsletter or my blog is to buy one of my books, The Intelligence Illusion: a practical guide to the business risks of Generative AI or Out of the Software Crisis.