Data aggregation: how it can break privacy

What do you consider to be personal identifiers?

For most, we think first of our name, but entering that into a search engine shows that even uncommon names appear quite widespread.

More unique identifiers are usually those provided us by the state: in Europe, insurance, health service or identity card (passport) numbers, in the US the social security number. Does that mean, though, that so long as such specific personal identifiers are removed from personal data, they need no longer be safeguarded as carefully (or at all)?

Aggregation of data

Back in the days when current European data protection legislation was being drafted, personal identifiers meant quite a small set of data which were very specific to the individual. This was because it was very difficult to add other information and so render the the data more specific to an individual.

Back then, a date of birth did not provide much of a clue as to who you were. With around 2000 births in the UK every day, it starts to narrow down, though. Add to that the county or country of your birth, and the numbers narrow further. Add your current town of residence, and it should not be hard to pinpoint an individual, if you had that additional information. But trying to collate that was very difficult, even for government computer systems.

These days we all have access to much more information which can readily identify an individual. The UK Information Commissioner’s Office illustrates this neatly in its code of practice for anonymisation, using postcodes. Taking an example postcode of, say, SO2 1AB:

  • SO would be specific to around 194,000 households
  • SO2 would narrow to around 8,600 households
  • SO2 1 would narrow to around 2,600 households
  • SO2 1A would narrow to around 120-200 households
  • SO2 1AB covers between one and 15 households.

Equipped with a full postcode and relatively little other non-personal information, such as age group and gender, it becomes remarkably easy to turn so-called anonymised data into explicitly personal data.

It is true that, under UK law at least, personal data includes not only data from which a living individual can be identified, but also other data or information which is already in the possession of the data controller, or is likely to come into their possession, which could lead to identification.

Given the number of different organisations and services which obtain data from us, and the established sharing of, and trade in, such data, it is relatively easy for someone to aggregate data which is essentially uncontrolled – as it does not itself contain personal identifiers – to enable identification of the great majority of individuals.

Safety in numbers

There is one very effective way of ensuring anonymity within data, and that is to pool it to an extent sufficient to make it impossible to separate out individuals. So instead of a company sharing individual anonymised records, such as
SO2 1A / male / 50-59 yrs / buys toothpaste once a month
those records must be pooled and summarised:
SO / males / 50-59 years / 51% buy toothpaste once a month.

Provided that the pools used are large enough to prevent individual data cells from getting too small, it is incredibly difficult to aggregate sufficient data to identify individuals. This is why in most official statistics you now see small numbers given as ‘less than 10’ or similar: if the actual figures were given, they could assist in the process of removing the anonymisation.

As with its other publications, the UK Information Commissioner’s Office produces excellent and very up-to-date guidance. But it is only guidance, and only applies within the UK. And such guidance only works when it is followed.

Ensuring privacy

Sometimes data aggregation leads to inadvertent loss of anonymity. Organisations which are respecting the law and following the guidance will then know how to deal with the situation, and no problem should arise.

But what happens when you have consented to the release of ‘anonymous’ data by an international organisation, which results in those data becoming fairly readily available to companies operating in states which do not have such good protection of privacy? There is then nothing to stop an overseas company aggregating your data to enrich its specificity, and selling it on to someone else, outside the protection of EU laws.

The fundamental problem is that under current antiquated law, data which has had personal identifiers removed from it ceases to have appropriate legal protection, even though it can be (ab)used to enable personal identification. The only data which is now (relatively) safe to deal with in the absence of legal protection is that which has been correctly pooled, to avoid the appearance of low cell numbers.

Therefore the definition of personal data needs to include any and all data which, by data aggregation, could lead to identification: in practice almost all data which are unpooled. All data for individuals or small groups, whether or not the data contain personal identifiers, needs to be handled with the same care as if it was personal data.