Considerations for Reviewing Differential Privacy Systems (for Non-Differential Privacy Experts)

The purpose of this document is to provide a high level understanding of the trade-offs that are required when designing and deploying differentially private systems (e.g., privacy, utility, number of trusted parties). Its goal is to enable a less-expert reviewer in their analysis of differentially private systems, and to suggest issues and dimensions to consider when reviewing a system.

Concepts and Intuitions

One common, core feature of differential privacy systems is "epsilon (ε), or a parameter that can be increased or decreased to control how much privacy a system provides. This subsection tries to give an intuition behind the role of epsilon in these systems.

However, because epsilon is a single number, and (at a high level) "lower epsilon means better privacy," epsilon sometimes becomes the one thing reviewers focus on when evaluating a differential privacy system. In fact though, epsilon is just one factor among many that determine whether a system effectively protects its participants. A differential privacy system can have a small epsilon value, while still providing little to no protection for its participants.

Epsilon can be thought of as a dial that controls the balance between privacy and usefulness:

At one extreme, if epsilon is zero, the system provides perfect privacy because it never actually uses the users' data—but then the results are completely useless.
As epsilon gets bigger, the system makes more use of individuals' data, which improves the accuracy of results, but also means less privacy because more information about users could leak.

The epsilon parameter limits how confidently an attacker can know whether a person's information is included in the collected dataset. Epsilon limits the attacker's confidence, even in worst-case scenarios, such as "if the attacker knows the full purchasing history of every single person on earth, how confident can the attacker be that Jane Doe's purchasing history is included in the aggregated dataset". These worst-case scenarios are often not realistic, but limiting these worst-case scenarios are what make differential privacy systems so protective (if done correctly).

However, because realistic attacker scenarios are (usually) very different from the worst-case possibilities, deployed differential privacy systems sometimes use relatively large epsilon values, even into the double digits.

Evaluating Differentially Private Systems

The following are suggested attributes to consider when evaluating a system that relies on differential privacy techniques to provide privacy guarantees.

Who do users have to trust in the system? Who are users being protected from?

Differential privacy can be deployed in three main settings: the central model, the local model, and the distributed model. Each of these reflect (and require) very different levels of trust from the participants (the people who the system designer wants to learn about):

Central model, where participants (sometimes called users, or clients) submit their actual, true values to a trusted central server, which transforms the true values into a modified aggregate dataset such that the contributors have plausible deniability for whether their information is included in the final dataset.
Local model, where participants do not share their information with a trusted party, and instead may modify their "true" values before sharing them with an aggregator.
Distributed model, or a wide range of hybrid systems that sit between the extremes of the central and local models. For example, some systems reduce how much trust is required in the central model through trusted hardware, multiparty computation, or other similar techniques.

This post by Damien Desfontaines provides further explanation of these three broad approaches.

These systems differ greatly in how much privacy protection they provide, and against which attackers. In the central model, participants must trust the initial data collector fully, since participants are sharing their complete, unmodified information with the central party, who is then trusted to modify the data on the participants' behalf before sharing that data with anyone else. In differential privacy terms, this "level of protection" is quantified as epsilon.

In general, the central model of differential privacy provides more accurate results (or, "utility"), while the local model of differential privacy does not require users to trust a third-party with access to unprotected data, albeit at the cost of less accurate results.

What mechanism does the system use to provide deniability for participants?

Differential privacy systems modify a participant's "true" data so that the participant can plausibly deny that their data is part of the final aggregated dataset. The most common way differential privacy systems provide this deniability is by modifying (sometimes referred to as "peturbing") the underlying data, by adding or subtracting a small amount from the true value. However, other techniques have also been used, such as making the submitted values less granular (e.g., "thresholding"), having participants abstain from contributing any data with a certain probability, or adding "dummy" submissions or values to the aggregate.

How much privacy does the system provide?

All systems that provide differential privacy do not provide equal protection to users contributing their data. Just as you can encrypt some data under a 2-bit key, or a 2,000-bit key, you can design DP systems to provide weaker or stronger protections. Simply saying a system provides DP guarantees alone doesn't tell you how well participants' privacy will be maintained.

You can quantify the privacy protection in DP systems as the "greatest possible information gain by the attacker". In DP systems this amount of information gain is described by epsilon.

What happens when you reach the end of a collection period?

Differential privacy systems are generally designed to collect data for a constrained amount of time. This is broadly because, as more and more data is collected, outliers will require more and more perturbation to not be identifiable in the final aggregate dataset. In differential privacy terminology, this is referred to as "exhausting the privacy budget".

As a result, differential privacy systems generally need to stop adding contributed data to the final aggregate dataset at some point, after which new contributions are added to the next, fully distinct dataset.

When evaluating a differential privacy system, it's important to make sure the system accounts for exhausted privacy budgets, and that data collected in one collection period is kept distinct from other collection periods.

Is the data being collected multi-dimensional?

Most differential privacy systems are designed to collect single-dimensional data (e.g., the age of the participant). Correctly handling multi-dimensional data requires careful consideration, and consideration not covered by the vast majority of differential privacy systems and research papers. Most differential privacy systems have difficulty with multi-dimensional date regardless of whether the additional dimensions are collected explicitly (e.g., participants contribute JSON messages that record both their age and their height), or implicitly (e.g., users contribute their age, but do so via HTTP, and so reveal their browser version and vendor in the HTTP message's "User-Agent" header).

If a differential privacy system is collecting multidimensional data, it's important to consider whether such concerns have been thoroughly considered.

Why does the system include differential privacy? Is it being used to support and empower users?

Different privacy systems are popular at the moment. This is partially because differential privacy promises mathematically rigorous protections of privacy, partially because differential privacy systems solve some of the problems that plagued previous popular algorithmic privacy techniques (e.g., k-anonymity), and partially because differentially privacy has been promoted (sometimes honestly, sometimes cynically) as a "magic wand" that allows companies to collect sensitive data without needing to get peoples' knowledge and consent.

Differential privacy systems can be exciting and clever, but have their own complications and limitations. Differential privacy systems can allow people to collaborate to collect aggregate data while maintaining privacy, but they require careful consideration, implementation, and knowledge of their tradeoffs.

Just as important, differential privacy's mathematical privacy promises should not be seen as a replacement for consent, personal agency, and human-centered computing.

In short, when considering and reviewing differential privacy systems, it's just as important to consider if:

Differential privacy is being invoked to help individuals achieve the individuals' goals in a way they would otherwise find untenable privacy-wise, or
Differential privacy is being used to help system owners achieve the system owner's goals, in a way that might be minimally-tolerable to users.

In short, reviewers should not mistake "differential privacy protections" with "user consent and user interest."

Audience and Goals and Non-Goals

Background