By Emily Litka
On December 18th, in response to a request from the Irish Supervisory Authority (“SA”), the European Data Protection Board (the “EDPB”) published an opinion (the “Opinion”) on the application of the GDPR to certain aspects of AI model development and deployment. Specifically, the Opinion addressed: (a) when and how a model can be considered anonymous; (b) if and how controllers can rely on the legitimate interests for model development and deployment; and (c) what the consequences of unlawful processing in the development phases should be on deployment and further operation of a model.
Key Highlights
Overall, the risks surrounding AI, in particular generative AI, that are discussed in this Opinion are not novel. They've been discussed heavily in the press and in early concerns raised by regulators. Where the Opinion sheds helpful light in particular is on the types of mitigations that SAs will expect to see and likely consider reasonable.
The Opinion is dense with qualified language. Despite the stated goal to help ensure consistent application of the GDPR throughout the EEA, this will Opinion still leaves significant interpretation to the SAs and may result in varied applications.
The Opinion concludes that models trained on personal data can be anonymous. However, the Opinion encourages SAs to, “by default,” conduct a thorough evaluation into claims of anonymity, and notes that the SA should check it has received sufficient evidence to show personal data used to train the model cannot be extracted (e.g., through membership inference attacks), and that the model will not produce such data in its outputs. These are still active areas of research and sufficiently documenting claims of anonymity may be exceedingly difficult and/or cost prohibitive.
The Opinion leaves the door open for legitimate interests to be a valid legal basis for model development and deployment. Controllers should consider the risks and mitigations that the Opinion raised, document their legitimate interest assessment, and be prepared to provide it upon request. Depending on the type of model and use case, this could be a significant burden to meet.
For models developed unlawfully, the Opinion affirms that SAs maintain broad corrective powers including the ability to impose fines, ordering the deletion of training datasets, and even deleting a model itself. However, unlawfully processed personal data in the development phase will not automatically result in the deployment being unlawful and should be assessed on a case-by-case basis.
Controllers deploying models developed by third parties should conduct due diligence assessments prior deployment to assess whether a model was lawfully developed. Third party-deployers should also consider contractual protections to address instances where a model they license from a developer is later found unlawful where it may impact operations.
Can AI models be considered anonymous?
At the outset, the Opinion addressed the hotly debated topic of whether AI models trained on personal data can be considered anonymous. It notes that, although a model’s parameters and encodings may not appear to be identifiable to the human eye, they may be organized in such a manner that would permit identification. The EDPB concludes that AI models trained on personal data cannot, in all cases, be considered anonymous and it is a question that should be addressed on a case-by-case basis.
For an AI model to be considered anonymous, the Opinion states that there must be an “insignificant” likelihood that (1) personal data can be extracted from the model; and (2) that a model’s output will contain personal data that was contained in the model’s training data.
The Opinion directs SAs to evaluate the controller’s documentation to determine whether a model is anonymous. In such documentation, the controller should demonstrate that it assessed the model based on “all of the means reasonably likely to be used to identify individuals” in or using the model by both the controller and unintended third parties that could access or use the model. The Opinion provides a non-exhaustive list of factors for SAs to evaluate, including:
whether any privacy preserving techniques were used to eliminate personal data in training data;
whether output measures are in place that lower the likelihood that personal data from training data can be obtained from querying the model;
the scope, frequency, quantity, and quality of tests that the controller has conducted on the model;
whether assessments against state-of-the-art attacks have been tested against;
whether the model has been technically designed to account for anonymization and if so, whether it is has been developed according to the plan and subject to engineering governance; and
whether the assessments are documented, including whether the controller conducted a DPIA (or has documentation in place for why a DPIA isn’t required).
In summary, controllers developing models that want to claim a model is anonymous should conduct thorough re-identification risk assessments that test against state-of-the-art attacks and methods for extraction. Controllers should document their assessments and be prepared to demonstrate them to SAs upon request. Given the rapid pace at which this technology is evolving, developers should consider conducting assessments at appropriate intervals to ensure that they are assessing against up-to-date extraction techniques.
Can legitimate Interest be a valid legal basis for developing and deploying AI models?
Much of the Opinion discussed this question. It confirmed that legitimate interests can be a valid legal basis where controllers satisfy the three-part test. The Opinion provided a non-exhaustive list of factors that controllers should assess if they seek to rely on legitimate interests for AI model development and deployment.
Step 1: Identify a legitimate interest pursued by the controller or a third party
The first step requires a controller to identify a legitimate interest pursued by them or third-parties. An interest is the benefit that a controller or third party may have in engaging in a specific processing activity. For an interest to be legitimate, it must be lawful, clear and precisely articulated, and real and present (not merely speculative). The Opinion provides that that the following may be legitimate: developing a conversational agent to assist users; developing an AI system to detect fraudulent content or behavior; and improving threat detection. Notably, the examples provided were related to specific applications of AI Models, but did not address the potential interests of underlying foundation models, which are often necessary to enable specific applications layers.
Step 2: Analyze the necessity of the processing the purpose(s) identified
The second step requires a controller to determine whether the processing of personal data is necessary for the interests identified in Step 1. Controllers must assess whether the processing activity will allow the interest to be pursued and whether there is a less intrusive way to pursue the interest.
Notably, the Opinion recommends that SAs pay particular attention to the volume of personal data processed for model development to ensure that it is necessary for the interest identified. It also recommends that SAs evaluate whether a controller has a direct relationship with the data subject to determine whether the processing can be completed with less intrusive means. As a practical matter, developers who collect data by scraping should be mindful to demonstrate in their analysis the mitigations they’ve implemented to address these concerns, though questions remain about the level of evidence needed to show that a particular training task could not have been done with less data.
Step 3: Determine whether the legitimate interests are overridden by the interests or rights and freedoms of the data subjects
The third step requires a controller to balance the interests, rights, and freedoms of the data subject with the interest of itself and/or a third party.
Data Subject Rights and Freedoms: Data subject interests are those that may be affected by the processing.
For model development, the Opinion states that data subjects’ rights and freedoms include the rights to:
self-determination; and
the ability to retain control over their personal data that was collected for model development.
For model deployment, the rights to:
retain control over their personal data processed by the deployed the model;
their financial interests (e.g., where a model is used by the data subject to generate revenues or in their professional capacity);
personal benefits (e.g., where a model is used to improve accessibility); and
socioeconomic interests (e.g., where a model enables access to healthcare or access to education).
Risks to Data Subjects Rights and Freedoms: The Opinion states that the development and deployment of AI models can “raise serious risks.”
For model development, the Opinion states that risks can surface when:
data is scraped against data subjects’ wishes;
without their knowledge; and
large-scale and indiscriminate scraping occurs which can create a sense of surveillance and self-censoring.
For model deployment, risks can surface when:
a model processes data in a manner that contravenes data subjects’ rights;
it is possible to infer, accidentally or by attack, what personal data the model was trained on, which can lead to reputational risk, identity theft or fraud, and security risk;
a model is used to block content publication from data subjects which can lead to risks to the freedom of expression;
a model recommends inappropriate content, which can lead to mental health risks;
a model’s recommendation leads to adverse consequences on the data subjects’ right to engage work (e.g., when job applications are pre-selected using a model);
a model discriminates against individuals based on certain personal characteristics; and
a model creates security and safety risks to the data subject (e.g., where a model is used with malicious intent).
The Opinion further stressed the need to weigh the reasonable expectations of data subjects as “a key role in the balancing test” because the complexity of the technology used for AI model development and deployment, and which may be difficult for data subjects to understand. The Opinion suggested that meeting the transparency requirements under the law “is not sufficient in itself” and that explaining the processing in a privacy policy does not necessarily mean the data subject reasonably expects the processing. This sentiment was also echoed in the recent fine imposed on OpenAI by the Italian SA. As part of its sanction, OpenAI is requiring OpenAI to carry out a 6-month campaign on radio, television, newspapers and the internet with the intended goal to “promote public understanding and awareness of the functioning of ChatGPT, in particular on the collection of user and non-user data for the training of generative artificial intelligence and the rights exercised by data subjects, including the rights to object, rectify and delete their data.”
The idea that "mere[ly] fulfill[ing] the transparency requirements” under the GDPR is not itself sufficient to conclude individuals should reasonably expect that their data may be used for model training could introduce new transparency challenges, especially for controllers collecting data about data subjects with whom they do not have a direct relationship.
Risk Mitigation Measures: When the data subjects’ rights and freedoms are overridden by the legitimate interest(s) pursued by the controller or a third party, the controller can implement mitigating measures to limit the impact of the processing on data subjects. Mitigating measures are safeguards tailored to circumstances of the case and will depend on different factors, including on the intended use of a model.
For model development, they include:
pseudonymizing personal data prior to training;
minimizing the personal data used for training to what is necessary (e.g., removing, redacting, or replacing identifiers before training);
after data collection, observing a period of time before the data is used for training for data subjects to object to the processing;
releasing information about the collection criteria and the datasets used for development; and
leveraging alternative forms of notice to inform data subjects that their data may be processed for development (e.g., media campaigns, transparency labels, model cards, annual transparency reports).
The Opinion provides mitigations to consider for web-scraping specifically, including:
not collecting personal data from publications that present heightened risks;
not collecting personal data from websites or sections of websites that have objected to processing (e.g., they have a robot.txt flag); and
creating an opt-out list for data subjects to inform the controller that they don’t want their data collected from certain websites.
For model deployment, they include:
using filters to prevent the storage, regurgitation, or generation of personal data in a model’s output;
using watermarks to minimize the risk of unlawful reuse of a model’s output;
enabling data subjects to have their data deleted from the model’s output; and
using post-training techniques to remove or suppress personal data.
If a model is developed using unlawfully collected personal data, can the model be used in development?
The last section of the Opinion discussed what the impacts of unlawful training should be on the further use of the model in deployment. It concluded that SAs maintain broad discretionary powers to address infringements that occur in development, including by imposing fines, ordering the deletion of the whole or part of a dataset that was processed unlawfully, and/or ordering the deletion of a model itself. However, unlawful processing in development will not always result in a model being unlawful to use in development. It considers three scenarios:
Scenario 1: A controller developed a model with personal data that was unlawfully processed, the personal data is retained in the model, and the model is deployed by the same controller
In this scenario, the unlawfulness of the processing in the development phase may impact the lawfulness of the processing in the deployment phase. The Opinion directs SAs to look to whether the development and deployment phases constitute separate processing activities to determine if the unlawful processing in development impacts subsequent lawful processing in deployment.
Scenario 2: A controller developed a model with personal data that was unlawfully processed, the personal data is retained in the model, and the model is deployed by another controller
Like scenario 1, the unlawfulness of the processing in the development phase may impact the lawfulness of the processing in the deployment phase. The Opinion doesn’t take a firm position on the impact that the unlawful processing in development should have on the subsequent deployment of the model by another controller, and instead, encourages SAs to assess the circumstances on a case-by-case basis. It notes that SAs should consider the sufficiency of the assessment performed by the deploying-controller.
It also notes that the assessment must be “appropriate”; and as part of the assessment, the deploying-controller should determine the source of the data and whether there was a finding of infringement known before deployment (e.g., the training data included data from a breach).
Scenario 3: The controller who developed the model processed personal data unlawfully, the model is anonymized, and the model is deployed by the same or another controller
In this scenario, the Opinion concludes that, if the developer or deployer can demonstrate that the model is anonymous, the GDPR would not apply to it. In other words, unlawful processing in development will not impact operating the model in the deployment phase. However, the Opinion states that SAs can still impose corrective measures for the initial unlawful processing that occurred in development.
To read more, please see the Opinion and the related press release.
Emily Litka is a Senior Associate at Hintze Law PLLC, focusing her practice on global privacy and emerging AI laws and regulations. She regularly counsels on risk during product development, the development and operationalization of privacy programs, the preparation of data protection impact assessments, and the development of internal privacy policies and processes.
Hintze Law PLLC is a Chambers-ranked and Legal 500-recognized, boutique law firm that provides counseling exclusively on privacy, data security, and AI law. Its attorneys and data consultants support technology, ecommerce, advertising, media, retail, healthcare, and mobile companies, organizations, and industry associations in all aspects of privacy, data security, and AI law.