Who was that masked man anyway?

Whenever you require a visit to the doctor, or hospital, a lot of personal data will likely be recorded, alongside details of your condition, the treatment you required, drugs prescribed and so on. At some point, you might have signed a consent form, one of the clauses of which allowed your data to be analyzed and shared, such as for research purposes, but in a way that you couldn’t be personally identified. But just how “anonymous” is this data?

In the USA, about two thirds of states share or sell “de-identified” medical data, stripping out directly-identifying personal data, and masking or suppressing other data. And yet a re-identification study showed that it in a slightly staggering 43% of cases, it was still possible to match publicly-available Washington State medical data to an individual. How? It doesn’t seem quite as tricky as you might imagine.

In the above case, it was done simply by correlating the medical data with newspaper data that reported details of hospital visits (due to accidents, muggings and so on). It’s a type of inference attack called a linkage or combination attack. One HMO case, referenced in Phil Factor’s recent article, became famous largely because a similar tactic allowed an attacker to identify the medical records of the Governor of Massachusetts.

Sometimes, re-identifying supposedly-anonymized data doesn’t even require auxiliary or background data. One published paper uses publicly-available healthcare (HCUP) data, where identifying fields had been removed and all aggregated values based on less than ten contributing rows were suppressed. It demonstrated how, by executing various combinations of queries against the available data, and some simple algebra, one could place rather narrow “bounds” on the possible values of the suppressed data. In one case, it concluded that exactly one Native American woman, diagnosed with ovarian cancer, was discharged by new Jersey hospitals in 2009, along with a lot of other detail about the treatment costs, location and so on.

Healthcare is just one example. By being increasingly willing or obliged to make data public, or just freely accessible within our organization, are we unwittingly breaching security? The EU, as shown by the GDPR requirements, are obviously alarmed at how easily privacy experts can unpick ‘masked’ data. We also have the nagging problem of allowing widespread ‘self-service’ access to data within the organization, for ad-hoc reporting. Hopefully, we tie-down direct access to base tables in a database, but a BI expert would find an inference attack an easy technique to learn. If your payroll system allows aggregate reporting, they may already know individual salaries and other personal information.

The task of keeping data secure and protected within an organization seems a lot more challenging than one might imagine. Data masking is the correct way to ‘anonymize’ your data, but if you only mask some of the data you still need proper data security and access control. Otherwise, like chinks in a mask, it can reveal tell-tale signs of the owner’s identity.

Commentary Competition

Enjoyed the topic? Have a relevant anecdote? Disagree with the author? Leave your two cents on this post in the comments below, and our favourite response will win a $50 Amazon gift card. The competition closes two weeks from the date of publication, and the winner will be announced in the next Simple Talk newsletter.