Just in time for Halloween: Masks, Ghosts, and open data!
At the recent FSRDC (Federal Statistical Research Data Center) conference in the US, a presentation was given on how secure data can be released while preserving confidentiality. While the US don’t have any solutions yet, it’s worth bringing this discussion to Canada because now that we have the ability to preserve and analyze administrative data, we need to make a decision regarding whether or not administrative data should be available. The fact that it hasn’t been available in the past for infrastructure reasons is not a good reason for non-access to be the default now (it’s worth noting that data on individuals in Scandinavian countries are often extremely rich and that a lot of research gets done on these populations simply because they’re the only ones with available data). I want to open with some ethical frameworks that apply to data availability and comment on them. Open data is gaining momentum. It’s the commitment by governments to make data open by default (unless there’s a good reason not to) in a timely manner for analysis. The goal of this initiative is to detect and dismantling corruption (a goal often applied to the third world, but also applicable here), and make policy analysis and policy relevant research easier (with the associated public benefit).
People are sometimes concerned about privacy (someday soon I’ll write about the massive differences in trust levels for the government that exist between generations) this is concentrated amongst older adults. The ethical concern that often applies to these decisions is that “no-ones data should be usable against them.” Take an example of someone walking into a car lot. While I may not have bad credit, if the dealer has access to the EI rolls and can find you on them, they may decline a car loan for reasons that only exist because your data were made available. This is a major concern also for court data. Even de-identified (all your personal information ex. Name, SIN, Address etc… are removed) data can be re-identified if they are sufficiently rich. For example, knowing my post code, and the date of a few days that I took off from work for medical appointments would likely permit my employer to search and find my medical records potentially identifying a disease I’d rather they not know, and they wouldn’t need to be able to search by name/address/SIN to discover that. So what are the solutions?
The easy one is to keep data private. Nobody has access to the data and thus nobody can be hurt by it. This is what is meant when people claim that “the only way to guarantee privacy is to prohibit data access”. This neglects the potential benefits of having data accessed. The economist in all of us knows that the marginal costs will be delineated in privacy lost, and the marginal benefit in research and productivity improvements. At this juncture it seems that even releasing some unconditional averages or medians (which should entail no cost in privacy) would have at least some benefit in terms of keeping governments accountable.
The other spectrum would be that all data is available and open, at this end, the privacy costs are immense, and the benefit of having every bit of data available is likely to be minimal after a certain level of openness.
Masked reporting: Suppose we put together a medical database and we want to put data into the public sphere. We have a series of binary variables (yes/no) that identify conditions. Can we let people calculate the incidence of disease without knowing which people have these diseases? The idea is simple, we introduce noise to the data to offer a substantial chance that the information for any given individual is false. Consider the following: For each possible question, flip a coin, if it turns up tails tell the truth, if it turns up heads flip another coin and say yes if heads and no if tails. This gives us the following reported matrix:
50% 50% YES δ 50% NO (1- δ) 50%
With 50% probability, δ percent (the incidence of the disease) will reveal a true yes, and (1- δ) percent will reveal a true no. With 50% probability the report will be a “false yes” or a “false no” (these may be true, or false, but we don’t care at this point).
From this I can back out the true incidence (δ) by subtracting the fake yes from the value we see in the data (in typical fashion I've given "yes" a value of one in the data and "no" a value of zero.
E(H) = (.25)(1) + δ(1) E(H)= δ+.25 E(H)-.25= δ
(this is just the mean you calculate from the fake data – 0.25)
I can do a similar trick with the standard errors of my mean by subtracting out the fake variance I added in.
The interesting part of this is that even if I can track down an individual in the data there’s a huge element of deniability for said individual. There’s a 50% chance that the answer I see is just random noise.
Now, I can’t calculate a regression coefficient from this masked data since the correlation between x’s and y’s is part arbitrary and part random (with no way to reliably disentangle the two for a good statistical analysis). So this is not an ideal solution for everyone.
Ghosts: Alternatively, with de-identified data, suppose I strip half the data out of the data I provide. Now if I go looking for an individual, I need to be certain that if I have identified someone, that NOBODY ELSE has those characteristics. This requires a lot of knowledge from the would be re-identifier and is much less likely to result in a data breach. All it costs me is some sample size, and if we’re talking about administrative data, this cost will be trivial to most analyses. I can theoretically combine this with masked reporting to further increase security (although with the caveat that I lose precision and the ability to infer from correlations). This is all leading to how we can get interested researchers access to data without ever over-compromising on privacy issues.
My ghosts aren’t like your ghosts: The most interesting part of database privacy is that it’s really hard to keep outliers private. When we’re dealing with maximums, it becomes almost impossible to keep people’s identity private. For example, up until recently, Liliane Bettencourt was the world’s richest woman. Conditioning only on “female”, it would be fairly easy to find her in the data and use her data against her. In fact, CEOs of public companies in general would be pretty easy to find given earnings since this information is available publicly (with the opportunity to exploit this data for personal enrichment) and is often in a very sparse area of the earnings distribution. Indeed, the biggest risk of privacy breaches comes from the ability of would be re-identifiers to combine multiple sources of data. If we are unwilling to let anyone be identified then it’s clear we need to cut out the top few percentiles of earners (and any other individuals who exist in a sparse part of a distribution.) Naturally this makes research involving outliers tough, but there’s a lot of information and interesting policy questions in the data of the 5th to 95th percentiles of continuous variables.
So what’s my proposed solution? It looks kind of like Canada’s data environment now, but with some carry-over to the US system:
Canadian researchers would have access to top-coded administrative data. On their own machines. Linked in whatever way they wanted to. (This linking would be done privately, the same way it is now provided their research grants paid for the linkages). If we had a prosecutorial arrangement with other countries that guaranteed that we could punish those who abuse their access this could be extended to international researchers too. Each researcher would get their own “version” the data with a TBD fraction of ghosts that are unique to them. Applications for this access would be subject to background checks similar to what happens with the RDC network now, and strict penalties for breach of the public trust should they re-identify the data or share their data with another researcher. Requests for data would have to posit a desired level for ghosts with anything that seems extreme heavily justified and open for review (as someone who has studied rare conditions I know what it’s like to have admin data on Canada’s largest province and a sample size of 195 individuals.)
The beauty of this proposal is that it solves (or at least removes a big hurdle from the solution to) a few problems trivially: 1. Collective data-mining: I will talk about this in a future post, but it goes back to an old concept that collectively, an entire community of academics working on a dataset will have essentially done the same thing as one individual running endless regressions looking for p<=0.05 (like a much less poetic version of 1000 monkeys with 1000 typewriters, only involving stressed assistant professors looking over their shoulders at tenure clocks) - sometimes called p-hacking. The statistical theorists and machine learning folks have been advocating for splitting up data for years. 2. Replicability: If everyone can get access to the same databank (albeit with different large samples) there’s no reason why I can’t take my version of the data used in a paper, run the same analysis, and get some idea of whether the result is due to noise (or even something more sinister –frankly unlikely). This is a serious problem. An attempt to replicate economic studies in the US showed that fewer than half could be recreated – this is in line with other SS disciplines. This replication crisis, which started in psychology, is pandemic within the social and natural sciences for various reasons. If everyone has fairly ready access to admin data it becomes easier to do replication exercises (although the institutional barriers to research replication still exist re: funding, publishing).
As for open data, the public would have access to averages and conditional averages using masking techniques with sufficiently low precision (think accuracy to the nearest hundredth for a percentage) to avoid disclosing the entire data through some brute force reconstruction techniques.
So what do you think, would you make your data available under these circumstances? Why/ Why not?
If you’re interested in this topic, there’s a free book here from one of the pioneers in the field of `differential privacy’. Not only is she more interesting than me, she's much more well versed in the topic :)