EU: Data scraping – navigating the challenges old and new
In this Insight article, Colin Lambertus and Neil Williamson, from EM Law, delve into the complexities and legal implications of data scraping, a practice gaining renewed attention in the age of artificial intelligence (AI) and widespread web-based information. They explore the interplay between data protection, intellectual property rights, General Data Protection Regulation (GDPR) regulations, and recent legislative and case law developments.
Data scraping has re-entered the public consciousness along with the flood of AI models recently made available. Training AI requires substantial data volumes, and one of the many methods to obtain that data is to web scrape. Web scraping involves the use of automated processes to access and extract available internet data, regardless of its presentation format. In essence, web scraping tools serve as hyper-fast versions of another popular tool among lawyers - the copy-and-paste function.
The current use of web scraping tools in the UK and the EU is primarily influenced by legislation, including the GDPR, case law, and regulatory developments around data protection. In addition, there are the respective rights of parties to safeguard or capitalize on intellectual property to consider.
The GDPR does not explicitly address the legality of web scraping.
Instead, its impact on web scraping is akin to the processing of any personal data collected by a data controller or processor through any other means.
The familiar GDPR rules therefore apply: the organization must have a lawful basis to process the personal data collected through web scraping, establishing the required technical and organizational safeguards for securing and managing said personal data, all while adhering to the data protection principles outlined in Article 5 of the GDPR.
However, does this imply that web scrapers can feel at ease? Not entirely. There are specific data protection challenges and, indeed, heightened risks that come into focus when an organization engages in web scraping.
Most will be familiar with Article 13 of the GDPR, which mandates the controller to inform data subjects of specific aspects of the processing of their personal data. This is the standard suite of information contained in most privacy policies linked to in website footers.
However, a crucial point, particularly relevant to web scraping, is that Article 13 of the GDPR applies when personal data has been collected directly from the data subject.
Where personal data is collected indirectly, as is the case with web scraping, Article 14 of the GDPR comes into play. Article 14 stipulates that the information outlined in Article 13 of the GDPR must be provided to the data subject at a later point. There are three scenarios:
- within a reasonable period after obtaining the personal data, but no later than one month;
- if the personal data is intended for communication with the data subject, at the latest during the first communication with that data subject; or
- if disclosure to another recipient is planned, at the latest when the personal data is first disclosed.
Hence, the absolute deadline is within one month.
However, there are exemptions, as detailed in Article 14(5)(b) of the GDPR, in that the necessary information does not need to be provisioned if it "proves impossible or would involve a disproportionate effort."
Organizations engaged in web scraping are therefore expected to comply with Article 14 of the GDPR. This may be achievable, but if web scrapers are processing the personal data of thousands of data subjects (or more), it becomes difficult.
For web scrapers collecting personal data, the most immediate task is to ensure that the necessary information about the organization's data processing activities is published on its website or other publicly accessible format.
Subsequently, the organization must decide whether to contact the data subjects or rely on an exemption.
Notably, in 2019, the Polish data protection authority (PDPA) imposed a significant fine on a Swedish web scraper that had been extracting personal data from official sources, involving approximately 7.5 million data subjects. Among these, around 600,000 data subjects had available email addresses, while 200,000 mobile numbers were being processed, and only a postal address was available for the rest. The Swedish web scraper opted to contact only those data subjects with email addresses, citing the high cost (millions of euros) associated with contacting the others. Therefore, it invoked the 'disproportionate effort' exemption.
The PDPA disagreed, asserting that merely placing the necessary information on a company website is not enough to meet the Article 14 GDPR notification obligation. Additionally, reaching out via telephone or postal mail to the remaining of the data subjects did not constitute a disproportionate effort even though it would have cost millions of euros to do so. Consequently, a fine of €220,000 was issued. This decision was appealed, and while the fine was recalculated the PDPA's decision-making process was upheld.
This decision may not represent the stance of other regulators in the UK/EU, but it provides a useful indicator of the potential regulatory response if a web scraper was brought before a regulatory authority.
If a web scraper wished to rely on the impossibility or disproportionate effort exemption, the ICO has made it clear that the scraper will need to make a documented assessment of its reasoning.
The ICO has provided helpful guidance in this respect:
- Impossibility: this exemption is not easily invoked. It will likely only apply if the organization lacks any contact details of the data subject and has "no reasonable means to obtain them." This point is highly important. An organization cannot rely on the impossibility exemption if it simply takes no action to ascertain whether a data subject's details could be obtained.
- Disproportionate effort: this exemption involves a balancing exercise. It requires weighing the effort required to contact individuals against the potential 'effect' that the processing will have on them. Therefore web scraping activities that are non-intrusive and of a light-touch nature might make it easier for organizations to justify their use of this exemption.
Invisible processing and DPIAs
The practice of web scraping without notifying data subjects is a form of 'invisible processing.'
In the UK, following Article 35(4) of the GDPR, the ICO is required to publish a list of processing methods that will require a Data Protection Impact Assessment (DPIA). A DPIA is a rigorous analysis carried out by the controller to assess the potential harm to data subjects, and the ways in which an organization will mitigate that harm to an appropriate level.
Where an organization is relying on the impossibility or disproportionate effort exemption for its web scraping activities and these activities also involve one of the indicators of the 'high-risk' indicators outlined by the Article 29 Working Party (e.g., large-scale processing, monitoring, automated decision-making, etc.), the organization must carry out a DPIA. However, if invisible processing is not combined with high-risk activity, the ICO's guidance is that a DPIA should still be performed.
It is important to consider jurisdictional requirements in this context. The Irish data protection regulator's Article 35(4) of the GDPR list includes a reference to invisible processing, but it does not necessitate the combination of invisible processing with a high-risk factor to mandate the organization to conduct a DPIA.
Therefore, for a web scraper, it is typically considered a good practice to carry out a DPIA. A DPIA will assist an organization in demonstrating that the processing is fair, and the analysis of the 'impossibility' or 'disproportionate effort' exemption can be seamlessly incorporated into any DPIA.
Article 5 of the GDPR sets out the key data protection principles that organizations must adhere to in all their processing activities. One of the most obvious requirements for web scrapers is Article 5(1)(c) of the GDPR, which states that the processing of personal data should be 'adequate, relevant, and limited to what is necessary in relation to the purposes for which they are processed.'
Web scraping inherently involves large-scale data collection. As a result, personal data may be collected incidentally. Therefore, organizations must consistently ensure that any personal data collected through web scraping is directly applicable to the purpose for which it is collected. General assumptions that all potential personal data related to the purpose must be scraped will not align with this principle. Likewise, the unintentional collection of personal data would not be compliant if it serves no purpose and is simply gathered because the tool harvests everything on a webpage.
Intellectual property rights
Aside from the important data protection considerations mentioned earlier, web scrapers must also assess whether their activities are infringing another person or organization's intellectual property rights.
At the fundamental level, the standard rules of copyright protection are applicable. For instance, compositions of words or individual images are works that may independently hold copyright.
Going beyond this, there is copyright protection extended to tabular arrangements, compilations, and databases. A significant amount of data on the internet is presented in this format. In essence, the structure of the database is protected when considerable intellectual effort has been invested in the development of the database’s structure.
Sui generis database right
While copyright is relevant, it does not safeguard the entirety of a database's contents (except for content that in itself is afforded copyright protection). The Database Directive (96/9/EC) established legal protection for databases. The Directive was implemented in the UK as The Copyright and Rights in Databases Regulations 1997 (the Regulations).
This sui generis (unique) right, known as the 'database right' under the Regulations, can come into play when copyright falls short. This right can exist if a database is systematically arranged, and there has been a 'substantial investment in obtaining, verifying, or presenting the contents of the database.' The key point here is that the 'investment' (in terms of time, money, or materials) must be in the formation of the database, not in the creation of the data itself.
The EU has recently implemented additional legislation that could prove beneficial to web scrapers -the Digital Copyright Directive. This directive permits 'extractions' from databases for 'text and data mining' purposes, including commercial use, unless the rights holder reserves its rights in the database to prevent such data mining (via their terms and conditions or website metadata). Under the directive, text and data mining refer to automated techniques used to generate information, such as identifying patterns, trends, and correlations. Accordingly, the activities of a web scraper might not automatically fall within this exception. Notably, the UK has recently rejected implementing similar legislation.
The reality is that web scrapers accessing databases, whether public or private, without a valid license in place will almost certainly be operating in the dark around the origin of the information. Without this knowledge, it will always carry a certain level of risk.
Potentially, a website operator can incorporate provisions in its user agreement to prohibit web scraping by users. However, the enforceability for such provisions in cases where the web scraper hasn't explicitly or implicitly agreed to them, as per the European Court of Justice, is subject to domestic law. In the UK, the Courts have not clearly ruled on this issue (although it is possible, depending on how the website is set up). The stance within the EU is similarly dependent on specific factual circumstances.