Anonymising big data - the struggle to make complex encryption simpler

As we wait for the holy grail of homomorphic encryption, Computing finds out what researchers are doing to fill the gap

In the days when data mostly sat in proprietary formats in proprietary silos it was generally more trouble than it was worth to cross-match datasets to profile an individual in any depth.

But the number of databases containing some part of our lives grows by the day, and by deploying big data technologies that can bulldoze straight through distinctions of format or intended purpose, piecing together a detailed, nuanced personal profile by combining disparate datasets is (relatively speaking) simplicity itself. A lot of data doesn't even touch the sides these days: it's streamed in, analysed on the fly then discarded or archived.

This is all fantastic for marketers keen to build a 360-degree real-time picture of customers, and for the intelligence agencies who piggyback on the technologies. It is less good for individuals worried about their privacy. Technology, it seems, is diametrically opposed to this fading ideal.

It is also a double-edged sword for researchers in medicine and other fields, for whom the current analytical techniques provide valuable insight into epidemics, rare diseases, demographic shifts, disaster response and other areas. The trouble is, the most valuable data is frequently the most sensitive - the hospital visit, the financial record, the inherited trait - information which, by law, must be protected. Protection is getting harder and harder.

So how can the utility of personal data to scientists, planners, health services and businesses be maintained while at the same time preserving the individual's right of privacy and control over their data? This is a particularly tricky nut to crack, which doesn't mean that people aren't trying. Computing spoke to three of them.

"In the past, people used to think about individual datasets in isolation," said Frank Wang, PhD student at MIT and co-creator of Sieve, a new platform that selectively exposes user data to web services according to privacy settings.

"However, the problem now is that there are so many data collected on each individual for the purposes of ad targeting and recommendations, it makes it very difficult to anonymise any dataset. In fact, because of this, there is a major trade-off in accuracy versus privacy when it comes to anonymisation."

Jessica Santos, global compliance and quality director at Kantar Health, believes that anonymisation may soon reach its practical limits.

"At the moment, most datasets are noisy and biased and require advanced data analytical skills and intuition to merge, so anonymised data is still being used," she said, adding that as the number of sources increases "strict academic anonymity" may not be possible.

What's in a name?
In the NHS's controversial care.data scheme - in many ways a perfect example of the tensions between anonymity and research utility - medical records are pseudonymised, that is, personally identifiable information (PII) such as house number and surname is replaced by a hash or token.

However, since cross-matching data by linking the non-redacted fields grows ever easier, and because it is impossible to know in advance what datasets might be released in the future that allow such re-identification, there have been moves to classify such pseudonymised datasets as personally identifiable information (PII), bringing them within the scope of data protection law.

"Where pseudonyms are linkable they are considered personal data," explained Jason du Preez of privacy company Privitar. "If you can prove that there is no linkability to a dataset then you have grounds to remove your case from the scope of the GDPR."

However, he said, the GDPR (the new EU General Data Protection Directive) gives researchers working in areas deemed to be in the common good a freer rein.

"Over time many aspects of GDPR interpretation will become contextual, strongly influenced by the perceived benefit to society," he predicted.

For researchers in the health sector, Santos recommended following the US HIPPA guidelines.

"HIPAA provides a list of 18 identifiers, which is very handy for researchers - just check the list and remove the identifiers."

However, she continued, the rarity of some medical conditions makes them personal identifiers in themselves, meaning they should be treated as PII. David Davis MP has said it would be easy to identify him from his medical record as he has broken his nose five times.

Anonymising big data - the struggle to make complex encryption simpler

As we wait for the holy grail of homomorphic encryption, Computing finds out what researchers are doing to fill the gap

Legal processes lag way behind technology, so shouldn't technologists take it upon themselves to lead the way in ensuring that data can be of value to businesses and researchers without encroaching dangerously on privacy, as suggested recently by Raffael Strassnig, a data scientist at Barclays? Yes, Wang said, but it's something of a minefield.

"It's tricky because it is very difficult to fully anonymise data and full anonymisation will sometimes make data not as useful. It is important to enforce anonymisation of very sensitive data like medical records and financial information, but I think it's very important for companies to think about the privacy implications of how they are using their users' data."

Santos believes that, in the face of such complexity, transparency rather than technology is the best way forward.

"Regulatory bodies are not offering clear guidelines on enforcements for lack of anonymisation or any cases as such," she said. "Personal data can be processed and used. We shouldn't - or couldn't - stay away from it. I am a strong believer that transparency will be the new privacy solution."

Homomorphic encryption - the holy grail

Homomorphic encryption, which allows computations to be performed on data without having to decrypt it first, is the answer to many a researcher's prayer, but like practical nuclear fusion, it always seems to be a few years off.

"There is a long road ahead for homomorphic encryption," Wang said, sweetening the pill a little by adding: "but the research is really ramping up".

"The principal hurdle to broad adoption at large scale is computational cost," added du Preez. "It is extremely computationally expensive. So while we are seeing some constrained application in areas where this is not an issue, a practical general-purpose solution is still very distant."

This is exacerbated by an absence of optimised hardware and software support, which makes encrypting files very slow, explained Wang, adding:

"People don't quite understand the power of homomorphic encryption, and as a result they are hesitant to use it. On top of that, there are not many tools available for companies to easily do homomorphic encryption."

Wang estimated that it will take at least 10 years for such a general-purpose homomorphic solution to become available.

Making complex encryption simpler

Since the homomorphic holy grail remains frustratingly out of reach, what other approaches are there?

Santos again prescribed the sunshine of transparency.

"If anonymising is not a guarantee, treat the dataset as personal data," she urged. "Using practices like gaining consent, applying transparency or notification, considering potential harm, and limiting third-party transfer are all well documented and widely used in the healthcare research field."

A semi-manual approach was mentioned by Wang.

"A common technique is to monitor results on dataset queries and only release them if they don't reveal too much information. However, the interpretation of 'too much information' is left up to the owner of the dataset."

Other methods include differential privacy techniques, which is where Privitar is doing a lot of its work.

"Methods for providing query responses with differential privacy guarantees are well advanced in the research community but not yet widely used commercially," du Preez said.

These techniques involve adding a small amount of noise to the dataset which is negligible for large-scale querying however "at small sample sizes the noise overwhelms the signal, making it impossible to violate individual privacy".

He continued: "Differential privacy provides provable privacy guarantees even in the face of arbitrary attacker background knowledge, in contrast to the controlled release of individual records through publishing methods."

Summing up, Wang said: "There are many promising approaches at the research stage. We are only touching the tip of the iceberg."

He continued: "Not all data is treated equally. In other words, not all data suffer from the same risk. We are learning how to apply and create new encryption methods that apply to a variety of data, that protect different aspects of the data like anonymity, privacy and confidentiality. The most promising thing is we are starting to figure out how to make more complex encryption methods simpler."

Computing's Enterprise Security and Risk Management Summit takes place in November. Entrance is free for qualifying IT professionals.