Thursday 24 August 2017 1:05 am

Re-identification of personal data: The common crime you’ve never heard of

There was a sense of inevitability to the Data Protection Bill announced by the UK government this month.

What especially caught my eye was the new criminal offence: “of intentionally or recklessly re-identifying individuals from anonymised or pseudonymised data” punishable by an unlimited fine.

It’s a welcome move that shines a spotlight on a relatively unknown but common crime with far reaching consequences. It also asserts in the strongest of terms that hashing personably identifiable information will no longer be considered good enough.

Here’s why. Anonymised data is simply data about people where the individuals are no longer identifiable, usually, because the results have been aggregated. At a very basic level, pseudonymisation maintains the data in a non-aggregated form but removes or hashes identifying features, such as a name. Both have the advantage of making the data “non-personal” and taking it outside the scope of data protection laws, making it easier to store, move, share and analyse.

Read more: Opaque ‘black box’ culture eclipses big data’s benefits

Poorly pseudonymised data is vulnerable to re-identification; sometimes even by people unwittingly doing so, simply because they have some special knowledge of the individuals involved and can recognise them from the details supplied.

One of the most notorious examples involved Netflix, which 10 years ago released “anonymised” ratings data as part of a competition to improve its recommendation algorithm. Researchers were quickly able to match ratings given on Netflix with other sources such as IMDb, and identify individuals. This led to a US class action law suit, which was joined by an “in-the-closet lesbian mom”, who claimed she risked being outed by Netflix.

Earlier this month, it was announced that German researchers had been able to purchase “anonymous” browsing histories of 3m citizens and identify most of them, as many included a social media handle, which could then be linked to a real person.

In both cases, the re-identification risk was highlighted by the data equivalent of an ethical hacker and was undoubtedly a good deed, as it called out the irresponsible actions of others. Even though the government says it will protect whistleblowers, the threat of criminal sanctions may put off many similar ethical projects and give a free pass to poorly anonymised data.

The protection of whistleblowers should go further and the government should implement a notification system to the ICO so that researchers can easily receive advance clearance for re-identification projects and be shielded from the risk of prosecution.

More also needs to be done to define what the government means by “anonymised” and “pseudonymised”. For example, would it be a criminal offence to identify the individual behind a Twitter handle, or the person who leaves the abusive comment on a blog post?

Equally, there needs to be a minimum standard of pseudonymisation that will be required to earn the protection of this new law. The current guidelines are incredibly vague and loose. The onus here is on the ICO to lay out clear tests that anonymisation must pass before it can be legally considered non-personal data.

If data is the fuel of the economy, we need to view it as a critical infrastructure and do more to secure it. As much as new technology can help attackers re-identify data, new technology is available that prevents it, or – as we are developing – ensures that the data can be used for analysis without the need to release it in the first place and risk re-identification.

Read more: Investors should be wary of firms that don’t take data governance seriously

The combination of understanding relating to re-identification techniques means that poorly anonymised data should become a thing of the past, and its release without the appropriate technical protection should be a criminal act.