Spain: Is hashing a good pseudonymisation technique?
The Spanish data protection agency, jointly with the European Data Protection Supervisor, issued a paper on the Introduction to the Hash Function as a Personal Data Pseudonymisation Technique ('the Paper'), setting out the main issues regarding hashing techniques as a mechanism for the pseudonymisation of personal data. The Paper is focused on data controllers who rely on hashing techniques, such as the blockchain sector, and analyses the nature and effectiveness of pseudonymisation as a technical measure in light of the risk of re-identification of the message generating the hash. Dmitry Alekseev, Senior Associate at ECIJA LEGAL SLU, discusses the pros and cons of hashing and what the Paper recommends for those who may use hashing as a pseudonymisation technique.
What is the nature and purposes of hashing?
The purpose of a hash function is to obtain a fixed-length character series from any input message, regardless of its type, for example the text, sound, and image, or size, through the use of a mathematical algorithm. The original message is divided into blocks, and the alphanumeric code resulting from the application of a hash function is called digest, image, hash value, or simply, hash, the latter being used for both the output and the process of applying said algorithm.
The degree of sophistication of the mathematical algorithm applied to the original message or dataset, for example, a Secure Hash Algorithm, defines the complexity and, potentially, the level of security, of the hashing technique and, thus, the possibility of retrieval of the original message. Regarding the protection of personal data, the more elaborate the hashing technique, the smaller the chance of reaching out to the original dataset and, hence, of re-identifying an individual. According to the Paper, the ideal or desirable properties of a hash function are:
- applicability to any type or category of original message, and regardless of its size;
- deterministic results, for example, the same input would result in same output, and the outputs would have a fixed length;
- the recovery of the original message must be very difficult, if not impossible;
- alteration of one bit of original message must yield a completely different digest, called diffusion;
- a message with the same hash value as another message must be extremely difficult to find (weak collision), as well as finding two messages with equal digest (strong collision); and
- any result of applying hash function has the same probability of occurrence as any other.
One of the most common purposes for the use of hashing techniques within privacy and the data protection ecosystem is to add an additional level of security by converting original messages, composed of, for instance, the name, email address, and phone number of a natural person, to a value that does not directly identify an individual, which is very useful for protecting a user’s information, such as a password. Moreover, strong hashing algorithms require a certain level of endeavour and technological means in order to be able to trace it back to the data subject behind the original message. However, in a world driven by Big Data and machine learning, even though pseudonymisation techniques, such as hashing, can indeed ensure data confidentiality, it is not considered as a method of de-identifying or anonymising data given that there still is a chance of recovering the original information.
Another benefit of using hashed information is to rapidly verify whether a certain piece of information has suffered any changes, thus pursuing the data integrity principle. This functionality, as well as the ability to keep track of any alterations, is especially useful in blockchain field.
Consideration of hashing as unique identifier
From a theoretical point of view, the outcome of the application of a hash function on an original message has finite variations or possibilities, even though there are infinite source messages. In this sense, messages that yield the same digest as a result, and, thus, are not unique, are the pre-image set.
As per today, it is complicated to find two original messages yielding the same hash value due to the finite nature of the mechanism, especially when taking into account the concept of 'order,' this is, a delimitation of the possible number of original messages, and therefore of the hash values, by restricting some of the characteristics thereof.
Potential re-identification of the original message
Although something not desirable, digests may be reversed back to the original message.
The order, which ensures hash value uniqueness and effectivity, also acts as a backdoor to recovery of the original message. The order inherent to the application of hashing measure can assist in re-identification of the source input by providing a logic behind a certain bit of information. For instance, taking an example from the Paper, a Spanish mobile phone number is composed of 12 characters, nine of which are the actual telephone number, two digits which indicate the country code number, and a plus sign. By having an idea of the order, the theoretical number of possible original messages would be drastically reduced:
- the first three characters would always be static, for example '+34';
- the fourth character is either 6 or 7;
- the total number of subscribers is limited; and
- the operator of the data subject who the number belongs to could narrow down the outcome even further.
In other words, the more extensive the order implicit in a message, as in the case of a telephone number or a postal code for instance, the more fixed and, therefore, identifiable the original message is. This defies the ideal properties of the hash and, at the end, its appropriateness as a pseudonymisation technique and effectiveness as a security measure.
This being said, original messages with a strict implied order have a low degree of entropy and may be subject to re-identification more easily due to lower risk of collision and little information included in the data sets. Greater entropy adds additional information to the data sets, raising the chances of potential collision and hindering de-identification.
On the other hand, the combination of unhashed identifiers, such as passport numbers, pseudoidentifiers, such as workplace or hobbies, and other bits of information, definitely increases the possibility of re-identification drastically, something that must be taken into account while evaluating the effectiveness of the hashing techniques.
Strategies to prevent re-identification
In order to prevent any re-identification of the original message, there is a number of techniques that hinder re-identification and raise the effectiveness of hashing:
- performing an encryption, before the application of hash function, therefore, to the original message, or after the application, therefore to the digest. An encrypted message may be accessed through the use of a key. However, this method presents some drawbacks tied to, among others, the key confidentiality, the robustness of the key generation, and the unicity distance principle (the fact that the key is determined and implicit in the encrypted text);
- add a random value ('salt') to the original message before hashing. This method also offers concerns, as the value of salt is also implicit in the generated hash, and the appropriates of this mechanism would depend, inter alia, on the length, uniqueness, and randomness of salt; and
- utilise single-use salt technique where the hash function is applied to a set composed of a random salt, one-use salt, and the original message. The major benefits of this mechanism are that, in order to re-identify the original message, it is required to have access to the one-use salt and random salt associated to said message, and the possibility of collision is guaranteed.
Other strategies for preventing re-identification, such as differential privacy models, consist of adding a noise value incorporated to the original message through the use of random, unconnected information, including graphic and sound, without any correlation with the source message.
Regardless of the chosen technique, it is absolutely crucial to analyse the appropriateness of hashing as a measure of the protection of personal data from the holistic point of view and in light of the real-life, actual processing operation. The Paper establishes, for illustrative purposes, some variables to be taken into account and that can decide upon the degree of risk for re-identification of the original message and, thus, the suitability of the hashing technique depends on:
- the type of algorithm used and the environment of processing, such as local and cloud;
- the entropy of the message and message space;
- the potential linkage to other bits of information;
- the keys, passwords, and the policies governing the same;
- the time during which the original message must stay robust; and
- continuous audits of the above.
These attributes are also applicable to encryption techniques, where the appropriateness is dictated by the useful life of the original message, therefore the period during which it has value or relevance.
Can data be anonymised through hashing?
Anonymised data no longer has a link to a specific individual and is unable to identify a particular natural person, which automatically leaves outside the of scope of the General Data Protection Regulation (Regulation (EU) 2016/679) any processing activity involving anonymised or de-identified information. This is a big advantage for some industries or even stages in a processing, where the data is being manipulated for analytical or statistical purposes and there is no need to identify an individual.
Provided that an anonymised piece of information cannot be re-identified through reasonable means by the data controller or third party, the feasibility of anonymising personal data through hashing depends on whether the controller can still re-identify the data subjects in an anonymised file. In the case where hash value can no longer be re-identified, it would be deemed as anonymised.
A hashing technique capable of effectively pseudonymising a message so that a data subject can no longer be identified without the use of additional measures is a great technical mechanism. It would be adding an extra layer of security to a processing activity, especially bearing in mind the costs of implementation due to the number of alternatives and the relative simplicity in use. Nevertheless, the usage of a hashing technique and the evaluation of the suitability must in all cases be accompanied by a thorough analysis of the risks in order to avoid potential re-identification of the original data.
Dmitry Alekseev Senior Associate
ECIJA LEGAL SLU, Madrid