Support Centre

You have out of 5 free articles left for the month

Signup for a trial to access unlimited content.

Start Trial

Continue reading on DataGuidance with:

Free Member

Limited Articles

Create an account to continue accessing select articles, resources, and guidance notes.

Free Trial

Unlimited Access

Start your free trial to access unlimited articles, resources, guidance notes, and workspaces.

UK: Guidance on generative AI - the legal basis for scraping data

Generative artificial intelligence (AI) models, that is to say, AI models capable of generating text, images, code, audio, video, and other content as part of their output in response to inputs or prompts, such as OpenAI's ChatGPT and Dall-E, Meta's Llama, and Google's Imagen (accessed via Gemini), require significant volumes of high-quality data in order to train the model and enable it to assimilate the information and refine its output, through an iterative process. Generative AI models do not 'memorize' or recount their training data, per se, but instead learn to predict the appropriate output based on probabilities having regard to patterns in training data.

According to OpenAI, ChatGPT was developed using 'three primary sources of information:' publicly available information on the internet, information licensed from third parties, and information provided by users or human trainers. Meta's Llama 2 was similarly 'pretrained on publicly available online data sources' and trained on '2 trillion tokens,' which are the units of data into which training data is split whereby each word, punctuation mark, or pixel, for example, would constitute a separate token. Both developers state that they either did not intentionally target for, or sought to remove from, training data sources with high volumes of personal data. The process of gathering or extracting, through the use of an automated tool or bot, data from websites, known as web scraping, of publicly available data including personal data has legal implications for website operators, developers of AI models, their deployers, and data subjects. Nicola Cain, of Handley Gill Limited, discusses these legal implications for all individuals involved in web scraping data.

Sutiwat Jutiamornloes/iStock via Getty Images

Website operators

Website operators are likely to be considered data controllers in respect of personal data hosted on their websites. They are therefore responsible for ensuring not only that they have a lawful basis in respect of their processing, pursuant to Article 6 of the UK General Data Protection Regulation (GDPR) and, where the processing involves special categories of personal data or criminal conviction and offense data, Articles 9 and 10 of the UK GDPR respectively, and provide appropriate transparency information to affected data subjects (Article 12 of the UK GDPR), but also that they ensure Protection by Design and Default (Article 25 of the UK GDPR) and implement appropriate technical and organizational measures to ensure a level of security appropriate to the risk (Article 32(1) of the UK GDPR).

Images on websites and website content may be protected as copyright works, comprising artistic or literary works respectively, and therefore their copying or adaptation are restricted acts which would infringe copyright contrary to Sections 17 and 21 of the Copyright Designs and Patents Act 1988 respectively.

Many website operators will include provisions in their website terms and conditions that seek to prohibit web scraping of data, which may be enforceable as a binding contract between the website operator and the user.

While these legal restrictions may be relevant measures to achieve security and Protection by Design and Default, where effective, they may not be sufficient to meet those obligations.

In August 2023, the Information Commissioner's Office (ICO) signed a 'Joint statement on data scraping and the protection of privacy' with data protection regulators from around the globe. The statement warned website operators, and social media companies in particular, that 'mass data scraping of personal information can constitute a reportable data breach in many jurisdictions,' and they were therefore obliged to 'implement measures to protect against unlawful data scraping' in the form of 'multi-layered technical and procedural controls.' Suggested measures include imposing rate limits, identifying and blocking bots, taking legal action in respect of unlawful scraping, and alerting users to the measures undertaken.

AI developers

Where generative AI model developers are established outside of the UK but undertake processing activities in relation to the personal data of individuals in the UK, Article 3(2) of the UK GDPR establishes that they will be subject to the UK GDPR where the processing relates to the offering of goods and services to individuals in the UK or the monitoring of the behavior of such individuals.

AI developers are likely to be data controllers with respect to personal data obtained through web scraping for the purpose of training their generative AI models. As such, their processing of personal data must be fair, lawful, and transparent to comply with Article (5)(1)(a) of the UK GDPR.

This obligation does not relate merely to compliance with the requirements of the UK GDPR or wider data protection legislation, in the sense of having a valid lawful basis for processing, but requires that the use of the data does not breach any other laws.

If and to the extent that web scraping of personal data infringes copyright laws or is in breach of a contract, then it will fail to satisfy the requirements of Article (5)(1)(a) of the UK GDPR.

In evidence to the UK House of Lords' Communications and Digital Committee's report 'Large Language Models and Generative AI,' which was published in February 2024, the evidence submitted by OpenAI was cited that it 'respect[ed] the rights of content creators and owner' but that it was 'impossible to train today's leading AI models without using copyrighted materials.' Other AI developers, including Meta, Stability AI, and Microsoft, also raised concerns about restricting access to training data, and all argued that their practices in web scraping publicly accessible data and using it for training generative AI models did not - or should not - infringe national copyright laws. This contradicted evidence given by rightsholders. The Committee called upon the Government to 'its view on whether copyright law provides sufficient protections to rightsholders, given recent advances in LLMs' and, if this revealed uncertainty, to 'set out options for updating legislation to ensure copyright principles remain future proof and technologically neutral.' In its response to the report, the Government stated that 'this is a complex and challenging area, and the interpretation of copyright law and its application to AI models is disputed; both in the UK and internationally,' and it did not wish to interfere in ongoing litigation.

Section 170 of the Data Protection Act 2018 creates a criminal offense of knowingly or recklessly obtaining personal data without the consent of the relevant data controller, but it is a defense if the obtaining was justified as being in the public interest.

The mere fact that personal data is publicly accessible does not exclude it from protections under the UK GDPR. This is the same position as under the EU GDPR.

Regardless, the ICO's emerging thinking, as set out in its draft guidance on generative AI and data protection 'Chapter 1: the lawful basis for web scraping to train generative AI models' suggests that it considers that AI developers could satisfy the lawful basis under Article 6(1)(f) of the UK GDPR that processing 'is necessary for the purposes of the legitimate interests pursued by the controller or by a third party,' having regard to their business interests in developing and commercializing generative AI models, as well as the wider public interest in specific deployments of such technological developments. As to the obligation of necessity, the ICO accepts the position expressed by AI developers that 'most generative AI training is only possible using the volume of data obtained through large-scale scraping.'

By contrast, guidance recently issued by the Dutch supervisory authority (only available in Dutch here) suggests that the acquisition of personal data through web scraping, including for the purposes of training generative AI models, will usually violate the GDPR.

Generative AI developers will need to conduct a Data Protection Impact Assessment (DPIA) in relation to the processing of personal data in the context of the development, training, and use of their models, and a legitimate interests assessment (LIA), regardless of whether the model is intended to be made available for public use or deployed internally. The DPIA will need to address challenges, including those regarding the transparency of processing personal data and implications for the exercise of data subject rights, the expectations of data subjects, and the transfer of personal data to other jurisdictions.

There is a risk that AI developers engaged in web scraping will scrape special categories of personal data (i.e., personal data revealing racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, the processing of genetic data, biometric data for the purpose of uniquely identifying a natural person, and data concerning health or a natural person's sex life or sexual orientation) or personal data pertaining to children or other vulnerable individuals, which attract additional protections. While Article 9(2)(e) of the UK GDPR provides a lawful basis for the processing of special categories of personal data where the data has been manifestly made public by the data subject, where the publicly accessible nature of data results from an unauthorized third-party publication or the failure of the website operator to deploy sufficient protections, that condition will not be met.

AI deployers

Deployers of generative AI models are likely to be data controllers, and potentially joint controllers with generative AI developers, of generative AI outputs.

Where generative AI models are trained on web-scraped personal data, to the extent that the scraping was unlawful, this can infect the deployer's lawful basis for processing, whether due to the implications for the fairness of processing, if personal data was considered to have been obtained without consent or since deployers could be engaged in secondary copyright infringement in relation to any infringing copy produced in the generative AI output contrary to Sections 22-23 of the Copyright, Designs and Patents Act 1988.

Data subjects

Unless a data subject's personal data is included in the output of a generative AI model and the individual becomes aware of that, data subjects are unlikely to be made aware of the potential processing of their personal data in the context of the training of such models.

Data subjects are entitled to submit speculative data subject access requests in accordance with Article 15 of the UK GDPR to generative AI developers subject to the UK GDPR, and may be able to enforce their rights, including the right to object to processing under Article 21 of the UK GDPR and/or the right to erasure of personal data in accordance with Article 17 of the UK GDPR. In relation to website operators from which personal data is scraped, data subjects could bring complaints and claims regarding any failure to deploy appropriate protections.

Nicola Cain Founder
[email protected]
Handley Gill Limited, London