The purpose of this document is to provide a high level understanding of the trade-offs that are required when designing and deploying differentially private systems (e.g., privacy, utility, number of trusted parties). Its goal is to enable a less-expert reviewer in their analysis of differentially private systems, and to suggest issues and dimensions to consider when reviewing a system.
The purpose of this document is to provide guidance on trade-offs that are required when designing and deploying differentially private systems.
The purpose of this document is to provide a high level understanding of the trade-offs that are required when designing and deploying differentially private systems (privacy versus utility versus the number and kind of parties trusted, etc.). This document intentionally does not discuss the mathematical mechanisms used in differential privacy. Instead, the goal is to enable a reader to be critical of differentially private systems, and to suggest issues and dimensions to consider when reviewing a system.
Differential Privacy (DP) is a framework for algorithmic privacy. When collecting data from a large number of people, a differentially private system enables the collection and analysis of aggregate data while protecting individuals' privacy in a mathematically precise manner. The intuition behind DP is that if the output of a system is not significantly affected by any individual's contribution to the input database, then an attacker cannot be confident whether any particular individual's data (no matter who and which database) is in the database-regardless of what other information the attacker may already have. DP systems introduce this uncertainty (or deniability) by modifying the collected data by adding randomness, though DP systems differ greatly in how randomness is added. Some examples include modifying or removing specificity from contributed values, adding or removing submissions from the aggregate dataset, or slightly modifying submitted values.
One common, core feature of differential privacy systems is "epsilon (ε), or a parameter that can be increased or decreased to control how much privacy a system provides. This subsection tries to give an intuition behind the role of epsilon in these systems.
However, because epsilon is a single number, and (at a high level) "lower epsilon means better privacy," epsilon sometimes becomes the one thing reviewers focus on when evaluating a differential privacy system. In fact though, epsilon is just one factor among many that determine whether a system effectively protects its participants. A differential privacy system can have a small epsilon value, while still providing little to no protection for its participants.
Epsilon can be thought of as a dial that controls the balance between privacy and usefulness:
The epsilon parameter limits how confidently an attacker can know whether a person's information is included in the collected dataset. Epsilon limits the attacker's confidence, even in worst-case scenarios, such as "if the attacker knows the full purchasing history of every single person on earth, how confident can the attacker be that Jane Doe's purchasing history is included in the aggregated dataset". These worst-case scenarios are often not realistic, but limiting these worst-case scenarios are what make differential privacy systems so protective (if done correctly).
However, because realistic attacker scenarios are (usually) very different from the worst-case possibilities, deployed differential privacy systems sometimes use relatively large epsilon values, even into the double digits.
The following are suggested attributes to consider when evaluating a system that relies on differential privacy techniques to provide privacy guarantees.
Differential privacy can be deployed in three main settings: the central model, the local model, and the distributed model. Each of these reflect (and require) very different levels of trust from the participants (the people who the system designer wants to learn about):
This post by Damien Desfontaines provides further explanation of these three broad approaches.
These systems differ greatly in how much privacy protection they provide, and against which attackers. In the central model, participants must trust the initial data collector fully, since participants are sharing their complete, unmodified information with the central party, who is then trusted to modify the data on the participants' behalf before sharing that data with anyone else. In differential privacy terms, this "level of protection" is quantified as epsilon.
In general, the central model of differential privacy provides more accurate results (or, "utility"), while the local model of differential privacy does not require users to trust a third-party with access to unprotected data, albeit at the cost of less accurate results.
Differential privacy systems modify a participant's "true" data so that the participant can plausibly deny that their data is part of the final aggregated dataset. The most common way differential privacy systems provide this deniability is by modifying (sometimes referred to as "peturbing") the underlying data, by adding or subtracting a small amount from the true value. However, other techniques have also been used, such as making the submitted values less granular (e.g., "thresholding"), having participants abstain from contributing any data with a certain probability, or adding "dummy" submissions or values to the aggregate.
All systems that provide differential privacy do not provide equal protection to users contributing their data. Just as you can encrypt some data under a 2-bit key, or a 2,000-bit key, you can design DP systems to provide weaker or stronger protections. Simply saying a system provides DP guarantees alone doesn't tell you how well participants' privacy will be maintained.
You can quantify the privacy protection in DP systems as the "greatest possible information gain by the attacker". In DP systems this amount of information gain is described by epsilon.
Differential privacy systems are generally designed to collect data for a constrained amount of time. This is broadly because, as more and more data is collected, outliers will require more and more perturbation to not be identifiable in the final aggregate dataset. In differential privacy terminology, this is referred to as "exhausting the privacy budget".
As a result, differential privacy systems generally need to stop adding contributed data to the final aggregate dataset at some point, after which new contributions are added to the next, fully distinct dataset.
When evaluating a differential privacy system, it's important to make sure the system accounts for exhausted privacy budgets, and that data collected in one collection period is kept distinct from other collection periods.
Most differential privacy systems are designed to collect single-dimensional data (e.g., the age of the participant). Correctly handling multi-dimensional data requires careful consideration, and consideration not covered by the vast majority of differential privacy systems and research papers. Most differential privacy systems have difficulty with multi-dimensional date regardless of whether the additional dimensions are collected explicitly (e.g., participants contribute JSON messages that record both their age and their height), or implicitly (e.g., users contribute their age, but do so via HTTP, and so reveal their browser version and vendor in the HTTP message's "User-Agent" header).
If a differential privacy system is collecting multidimensional data, it's important to consider whether such concerns have been thoroughly considered.
Different privacy systems are popular at the moment. This is partially because differential privacy promises mathematically rigorous protections of privacy, partially because differential privacy systems solve some of the problems that plagued previous popular algorithmic privacy techniques (e.g., k-anonymity), and partially because differentially privacy has been promoted (sometimes honestly, sometimes cynically) as a "magic wand" that allows companies to collect sensitive data without needing to get peoples' knowledge and consent.
Differential privacy systems can be exciting and clever, but have their own complications and limitations. Differential privacy systems can allow people to collaborate to collect aggregate data while maintaining privacy, but they require careful consideration, implementation, and knowledge of their tradeoffs.
Just as important, differential privacy's mathematical privacy promises should not be seen as a replacement for consent, personal agency, and human-centered computing.
In short, when considering and reviewing differential privacy systems, it's just as important to consider if:
In short, reviewers should not mistake "differential privacy protections" with "user consent and user interest."