Support Centre

You have out of 5 free articles left for the month

Signup for a trial to access unlimited content.

Start Trial

Continue reading on DataGuidance with:

Free Member

Limited Articles

Create an account to continue accessing select articles, resources, and guidance notes.

Free Trial

Unlimited Access

Start your free trial to access unlimited articles, resources, guidance notes, and workspaces.

France: CNIL's AI how-to sheets - an overview

In this Insight article, Daniela Schott and Kristin Bauer, from KINAST, explores the intricacies of data protection in artificial intelligence (AI) system development, shedding light on the critical considerations, legal foundations, and guidelines provided by the French Data Protection Authority (CNIL).

hakule / Signature collection / istockphoto.com

On October 16, 2023, the CNIL published guidelines on data protection in the context of AI. The guidelines are published as seven how-to sheets and provide practical guidance for the development of AI systems and the creation of datasets containing personal data used in the learning of these systems. These guidelines point out that they only apply to the development phase of AI systems and not to the implementation phase and they explicitly exclude data processing that falls under the Police Directive, State Security, and National Defense. The focus is on data processing within the framework of the EU General Data Protection Regulation (GDPR) and is aimed at professionals with a legal and technical background, such as data protection officers (DPOs) and AI professionals.

They distinguish between three scenarios:

  • where no personal data is present in a dataset (in which case the guidelines may nevertheless provide best practice recommendations);
  • where personal data is definitely present; and
  • where personal data is present but not explicitly intended for collection. If personal data exists incidentally, the guidelines are applicable, although measures may be taken to delete such data through manual or automatic verification, thereby removing their personal nature.

The CNIL's guidelines encompass various AI systems, including machine learning (supervised as well as unsupervised, and reinforcement learning) and deterministic systems based on logic and knowledge.

The guidelines extend to both specific-purpose and general-purpose AI systems, whether involving continuous learning or single-time learning.

The guidelines cover two main phases: the development phase (consisting of system design, dataset creation, and learning/training) and the deployment phase. Continuous learning systems collect data for iterative improvements throughout their usage.

Further, the guidelines do not address the shutdown or deletion phase of an AI system and emphasize the importance of adhering to data retention limits.

Sheet 1: Determining the applicable legal regime

The CNIL emphasizes compliance with relevant regulations when handling personal data in AI system development and dataset creation. They outline the need to determine the applicable legal regime, clarifying that the GDPR generally governs personal data processing during the development phase while different legal regimes may apply to the development and deployment phases of an AI system. The CNIL highlights two cases:

Case 1: When the purpose of using the AI system is clear from the development phase and aligns with the deployment phase, the same legal regime applies. For instance, if the development aims at achieving specific operational goals identifiable during deployment, they generally fall under the same legal framework.

Case 2: Some AI systems, termed general purpose, lack specific operational uses during development but are later applied in various contexts. Here, the legal regime for development might differ from that in the deployment phase. Typically, processing during development is considered under the GDPR, subject to case-by-case analysis. For instance, developing a voice recognition model for potential commercial uses falls within GDPR scope during dataset creation, yet deployment may involve law enforcement directives depending on its operational use.

In essence, the CNIL advises determining the legal framework during AI system development based on the system's intended operational use, emphasizing compliance with GDPR unless specific legal requirements or national security directives apply during deployment.

Sheet 2: Defining a purpose

The CNIL underscores the importance of purpose definition when handling personal data in creating datasets for AI systems, in line with the GDPR. This entails establishing clear objectives at the project's inception, documenting these in the record of processing activities, ensuring they're comprehensible, and aligning with the organization's tasks.

The CNIL distinguishes two cases based on the operational use identification of AI systems:

Case 1: When the AI system's deployment phase purpose is precisely determined during the development phase, both phases pursue a singular objective. For instance, creating a dataset of train images for developing an algorithm measuring metro attendance aligns with the purposes of both phases.

Case 2: General-purpose AI systems lack clear operational uses during development but are employed later in various contexts. Creating datasets for such systems without predetermined operational goals does not meet GDPR criteria unless the purpose is detailed and explicit.

For compliance, the defined purpose must be detailed enough, referring to the system type (e.g., language models, computer vision systems) and foreseeably feasible functionalities at the development stage. Generic purposes such as 'development of a generative AI model' lack the necessary precision.

The CNIL advises controllers to identify potential risks, mention design-excluded functionalities, and specify usage conditions for AI system purposes. Additionally, while certain derogations may be advantageous for data processing in scientific research, it is still essential to define the research purpose, allowing flexibility in specificity as research progresses.

In summary, the CNIL's guidelines emphasize a meticulous definition of purposes in AI dataset creation, aligning with GDPR principles and ensuring transparency, specificity, and compatibility with the organization's tasks while allowing flexibility for scientific research.

Sheet 3: Determining the legal qualification of AI system providers

AI system providers must determine their legal qualifications under the GDPR when creating datasets with personal data for AI system learning. They may be categorized as controllers, joint controllers, or processors based on their role in determining the purposes and means of data processing.

Providers initiating AI system development and creating training datasets from self-selected data may be labeled as controllers. For instance, a video-on-demand platform reusing customer data to train an AI recommendation system is a controller for this new processing.

When an AI system provider reuses data collected by another entity, a distinction is made between the data diffuser (entity uploading data online) and the reuser (provider processing data for their own use). Both are responsible for separate processing.

Academic hospitals sharing separate medical imaging data for a common AI system's training may be joint controllers. Similarly, a consortium experimenting with smart cameras involving a municipality and two companies to analyze traffic behavior could be joint controllers for the AI system's training dataset.

AI system providers might act as processors when developing systems for clients. However, if the provider determines the purpose and means of AI systems for marketing, they could qualify as controllers. Subcontractors assisting in data collection or processing based on documented instructions are considered processors.

AI system providers must classify themselves accurately as per their role in determining data processing. They need to comply with the GDPR, including ensuring processor compliance and limiting data processing to specified instructions. Moreover, using the same dataset across different services generally indicates the provider's role as a controller in separate processing.

Sheet 4: Ensuring the lawfulness of data processing

Ensuring lawfulness is imperative within legal frameworks. Sheet 4 examines the essential legal foundations for processing personal data in AI system training datasets under GDPR compliance. The known GDPR legal bases (Articles 6 and 9 of the GDPR) serve as potential avenues, each requiring strict adherence to specific criteria and conditions.

To maintain lawfulness, controllers must conduct compatibility assessments when collecting or reusing data, especially when sourcing it from public sources. When reusing publicly accessible datasets, organizations must verify legality, absence of sensitive data, and GDPR compliance. Similarly, when obtaining data from third parties, ensuring legality and GDPR compliance through formal agreements is advised.

The sheet underscores the necessity for comprehensive assessments at each processing stage, prioritizing legality, consent, compatibility, and alignment with the GDPR. This approach ensures the ethical and lawful handling of personal data in AI system training datasets.

Sheet 5: Carrying out a Data Protection Impact Assessment when necessary

Conducting a Data Protection Impact Assessment (DPIA) is pivotal for recognizing and evaluating potential risks tied to handling personal data, allowing the formulation of strategies to alleviate them. Leveraging tools provided by the CNIL aids in preemptively managing risks associated with data processing, ensuring continuous supervision and control.

The DPIA involves key steps, including identifying and assessing risks to individuals whose data may be collected, and evaluating the probability and severity of these risks. Additionally, it involves analyzing measures that empower individuals to exercise their data rights, ensuring transparency and control. Assessment of data processing transparency, encompassing elements like consent and information provision, is also integral.

In AI system development, a DPIA is mandatory in specific cases where processing could significantly jeopardize individuals' rights and freedoms. The CNIL has mandated DPIA for certain personal data processes, particularly those involving AI systems, such as profiling or automated decision-making.

Determining whether an AI system's use is innovative or constitutes large-scale processing involves evaluating technological novelty, risk awareness, and data volume. For instance, established AI techniques might not be considered innovative, whereas newer methods like deep learning could fall under this category.

Following DPIA completion, strategies are devised to mitigate identified risks. These can include AI-specific technical solutions (e.g., homomorphic encryption, synthetic data), data protection methods (e.g., differential privacy, federated learning), and governance and organizational measures to safeguard data throughout AI development and deployment phases.

Ultimately, the DPIA stands as a comprehensive risk management tool in AI system development, addressing potential risks specific to personal data processing and establishing effective measures to mitigate them.

Sheet 6: Taking data protection into account in the system design choices

Developing privacy-friendly AI systems requires meticulous attention to design, as mandated by Article 25 of the GDPR. Sheet 6 outlines crucial steps:

  1. Adherence to data protection principles: Ensure compliance with principles like minimization when designing AI systems, considering data sources, selection, and validation of choices.
  2. Specification of deployment objectives: Define clear objectives for system deployment, aligning them with essential information for effective use.
  3. Definition of technical architecture: Choose AI model architectures that respect individuals' rights by minimizing data while meeting performance goals.
  4. Model training considerations: Account for uncertainties in architecture performance during training, utilizing scientific knowledge and state-of-the-art techniques.
  5. Privacy-centric design choices: Utilize protocols like federated learning and cryptographic resources to control data access, while also limiting data collection.
  6. Identification of essential data: Select necessary personal data based on relevance, adequacy, and minimization principles, considering volume, categories, and sensitivity.
  7. Validation of design decisions: Validate choices through pilot studies and ethical committee involvement to ensure technical relevance and ethical considerations.
  8. Incorporating ethical committees: Engage diverse, independent ethical committees to evaluate societal consequences, prevent misuse, and ensure ethical AI development.

Prioritizing data protection in AI system design involves setting clear objectives, choosing appropriate technical architectures, making informed data selections, and incorporating ethical oversight, ensuring responsible and ethical development.

Sheet 7: Take data protection into account in data collection and management

The sheet explores the criticality of data protection in data collection and management, stressing the necessity to adhere to standards throughout the process. It underscores implementing privacy by design principles from the outset of AI system development, focusing on privacy considerations during dataset collection. For instance, it advises restricting web scraping to freely accessible data, setting precise collection criteria, and promptly removing irrelevant data.

Data cleaning ensures quality by rectifying errors, eliminating duplicates, and removing unnecessary fields. Selecting relevant data for AI learning requires identifying essential characteristics while ensuring balanced representation among interest classes.

Various approaches for selecting relevant data are detailed, including feature selection techniques, interactive data annotation like active learning, and dataset pruning. Data retention periods are stipulated under the GDPR, requiring data controllers to justify retention periods, considering the purpose of data collection. The sheet suggests setting retention periods during the development phase and subsequently for maintenance or improvement purposes, ensuring adherence to the principle of data minimization.

Emphasis is placed on implementing security measures like encryption, access restrictions, and documentation to safeguard data during collection, storage, and usage. Detailed documentation aids in demonstrating lawful collection, monitoring until deletion or anonymization, and mitigating unintended data use.

Ongoing monitoring is crucial to detect data drift, requiring periodic analysis to ensure data relevance and sufficiency for processing purposes. The sheet underscores the meticulous and layered process of collecting, managing, and safeguarding data in AI system development, emphasizing the pivotal role of privacy and security adherence throughout these procedures.

Conclusion

The sheets published by the CNIL do not only provide useful guidance for France-based AI developers and DPOs but also for everyone who intends to set up a data protection-compliant AI system. Therefore, these sheets provide comprehensive information for professionals involved in AI development, aiding them in navigating the complexities of handling personal data within the GDPR framework.

Daniela Schott, LL.M. Attorney at Law (Germany)
[email protected]
Kristin Bauer, CIPP/E Attorney at Law (Germany)
[email protected]
KINAST Rechtsanwaltsgesellschaft mbH
Attorneys at Law ( Germany)