8.1. What is randomized response?
If someone, or software on their behalf, adds some randomness to their data, the aggregate data from many users might still be quite useful, even if individual responses are not accurate. This can be used, for example, to give survey respondents plausible deniability. More on randomized response here.
8.2. What is synthetic data?
A synthetic data set is a set of fictional data, generated to have the same characteristics as a set of real data on which it is based.
Creating a synthetic data set may provide excellent privacy protection: no individual’s data is being revealed. But if the process of creating the synthetic data set is flawed, data on some users may slip through the process relatively unchanged.
There’s a subtle problem of deciding what features of the original data set to retain. A model is fit to the original data, then new data is generated using that model. Did the model retain the features important for your purposes? It’s hard to say. The point of creating a synthetic data set is that the recipient’s use of the data is unknown. If the use was known, someone could run that analysis rather than creating a synthetic data set.
8.3. What is k-anonymity?
The idea of k-anonymity is that every database record appears in the database at least k times. That sounds simple, but which records should you focus on? In a database table with hundreds of columns, every single record may be unique. In practice. people mean that each record appears k times when you focus on some set of potentially identifying attributes. Which attributes should those be? What value of k is advisable in a particular context?
8.4. What is ?-diversity?
?-diversity is a variation on k-anonymity indented to address some of the weaknesses of the latter.
Suppose you apply k-diversity on quasi-identifiers to a data set and a set of k people who share the same quasi-identifiers also share the same value of an additional attribute. The data set does not leak identity, but it does leak information. ?-diversity addresses this issue.
8.5. What is t-closeness?
See this article.
8.6. What is earth-mover distance?
See this article.
8.7. What is differential privacy?
Differential privacy is a relatively new approach to data privacy, one that seeks to quantify and minimize the effect that any one person has on inferences drawn from a particular data set.
See an introduction to differential privacy here.
8.8. What are the advantages of differential privacy?
One of the strengths of differential privacy is that it treats all data as equally sensitive. This is also one of its weaknesses. It is a conservative approach to data privacy, but in some cases it can be too conservative.
Differential privacy adds randomness to query results, calibrating the amount of randomness to correspond to the sensitivity of the query. This is done automatically. No one has to determine a priori how sensitive all possible queries are; the software implementing differential privacy determines this on the fly using the actual data. No one has to anticipate what might be identifiable and possibly guess incorrectly.
Here is an example of differential privacy in action, making it possible to answer questions about dates that would not be allowed under HIPAA Safe Harbor.
8.9. How can competitors share data without giving up their data?
See this article.
8.10. Does differential privacy scale up?
Absolutely. The US Census Bureau applied differential privacy to the 2020 US census. More on that here.
8.11. Does differential privacy scale down?
This is more of a challenge. Differential privacy does not work well with very small databases: the amount of randomness needed to secure queries may be too high.
How small is too small? That depends on the nature of the data. We could help you answer that question.
8.12. What is a privacy budget?
A privacy budget is the way differential privacy keeps track of the cumulative risk to privacy from running multiple queries. More on privacy budgets here.