Differential Privacy in R using diffpriv

2023 May 26

Differential Privacy in R using diffpriv

Originally published on December 22, 2020.

A while ago, right here on Data Hackers, I posted about the consequences of relying solely on anonymization, and how Differential Privacy (DP) can help minimize adversarial attacks, such as Membership Inference [AN1] among other types of attacks/failures.

However, in this post, I will try to bring some practical aspects, and of course some code to make some concepts a bit more concrete.

For those who are just arriving at this post and don’t know what Differential Privacy is, I recommend reading the article “[What is Differential Privacy and how is it related to security in Machine Learning?](/differential-privacy-and-machine-learning-en)” which has a very simple explanation on the topic. Today’s post will be about potential uses, implementation, and tradeoffs between rigor x uncertainty when using DP.

Uses of Differential Privacy

Differential Privacy can be applied in several scenarios such as:

Situations where sensitive data analysis is required in which privacy cannot be violated (e.g., health information of a group of people, banking information, etc.);
To increase the robustness of Machine Learning algorithms in relation to unseen data (e.g., applying DP multiple times on the training set by modifying the epsilon);
In minimizing Data Leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)) through the reuse of [holdout* sets modified with a DP mechanism; this mechanism is known as Adaptive Data Analysis;
In Stepwise-type regression models in which at the time of model adjustment (fit) the training set is reused using Differential Privacy to remove sample bias;
In making synthetic data or micro-data available (e.g., census information); and
Generation of data sets with minimum privacy guarantees for Machine Learning models (and even data augmentation ([Data Augmentation](https://en.wikipedia.org/wiki/Data_augmentation)))

These are just a few scenarios that illustrate the potential use of Differential Privacy in daily life.

However, let’s move on to the code part in a sensitive data analysis scenario.

Using diffpriv to Apply Differential Privacy

The diffpriv is an R library created in 2017 by Benjamin Rubinstein and Francesco Alda for data analysis that requires the application of differential privacy; such as, for example, making reports/data sets/analyses available to third parties.

The diffpriv contains the implementation of generic mechanisms to transform private data. The present mechanisms are: Laplace, Gaussian, Exponential, and the Bernstein Mechanism. The package documentation is very complete and the explanations are very didactic.

In our case, we will perform an analysis of data related to salary and age of people in a given company. The idea is to see the differential privacy mechanism in practice relative to the final result.

First, let’s install the gsl and diffpriv libraries [AN2]:

With the libraries loaded, let’s create a data set with the employees of this company:

Now that we have our data set in R, let’s create a function to calculate the average; this function will be used to calculate the average age and salaries of this company:

We can see that the average age of the people in the data set is 45 years and the average salary is 9830 reais.

However, this precise information cannot be passed on since we only want to represent the data dynamics instead of offering any analysis/mechanism that violates the privacy of these people.

Therefore, we will move on to the creation of a differential privacy mechanism using diffpriv.

The diffpriv library has 4 noise addition mechanisms for Differential Privacy. They are:

[Exponential Mechanism](http://www.portalaction.com.br/probabilidades/612-distribuicao-exponencial)
[Gaussian Mechanism](https://en.wikipedia.org/wiki/Normal_distribution)
[Laplace Mechanism](https://en.wikipedia.org/wiki/Laplace_distribution): The Laplace Mechanism will simply compute the function f() and perturb each record with noise, and the noise scale will be calibrated according to the sensitivity preserving privacy (ε, 0); and
Bernstein Mechanism: It is based on the mechanism developed in the seminal work of Alda and Rubinstein (2017). This mechanism uses Bernstein polynomials to approximate the privacy function by working with perturbations applied to the coefficients instead of perturbing the values themselves. Since the coefficients are the only components of the function that will suffer perturbations, this is already sufficient for privacy preservation.

That said, let’s choose the Laplace Mechanism which will add noise to keep our numerical answers private.

To do this, we will pass our returns_average function to this Laplace mechanism (which we will call privacy_mechanism) as follows:

Within the DPMechLaplace mechanism we have the following attributes:

sensitivity: This variable will determine the inherent perturbation amplitude of the function; a perturbation that we will pass to our numeric list. In other words, sensitivity is the magnitude by which the data of a single individual can change the function f in the worst case. A higher sensitivity necessarily increases privacy; but also uncertainty/noise in the data;
target: It is a non-private function that will have its result (output) privatized and expects a list of values. The Laplace Mechanism expects functions that will return numerical vectors as a result;
gammaSensitivity: Privacy confidence level. It means that high confidence will demand greater sensitivity (and a wider noise spectrum, i.e., more uncertainty in the data and less utility) while a lower confidence level reduces the noise spectrum, and thereby increases utility (as it has less uncertainty/noise) but reduces the privacy level.

After defining our privacy mechanism, we will create a function choosing a distribution that we will pass as noise. In this case, we will choose the normal distribution (rnorm) to generate the quantity of records n (which in our case will be the number of employees) and we will perform an execution test.

I multiplied the results by 10,000 just so that the result of this list was on the same scale as the salaries. The distribution must always be on the same scale as the response variable that will be privatized.

An important piece we have to create is the sample generation mechanism that will take into account the sensitivity we determined in the privacy_mechanism in addition to the distribution function that will generate the randomized samples.

The intuition behind the sample generation mechanism based on sensitivity (which in diffpriv is called sensitivitySampler()) is that for each record in the database, independent and identically distributed random records will be created using the privacy mechanism we chose earlier. All this while taking into consideration the fact that the noise will be generated according to the chosen sensitivity (i.e. within a certain noise spectrum).

This is implemented in diffpriv as follows:

The function sensitivitySampler() has the following parameters:

object: An object of the DPMech-class (that is, a privacy mechanism);
oracle: An oracle that will be the source of random records. A function that returns a list, matrix, or data.frame;
n: Number of records contained in the data set. In this case, it must be exactly consistent with the number of records that will pass through the DP process;
m: Integer value that indicates the sensitivity relative to the sample size; and
gamma: Privacy confidence level according to the Rényi Differential Privacy (RDP) function [AN3]

With our private_response_mechanism object created, we can check some of its values by executing the following block of code:

With our private response mechanism created, now we can move on to the part where we generate our privatized responses relative to the data set we created at the beginning of this post.

To do this, let’s call the releaseResponse function as in the snippet below:

The parameters of releaseResponse are:

mechanism: An object of the DPMech-class type, i.e., the sensitivitySample that we instantiated earlier which will be responsible for generating our randomized responses considering the mechanism we chose (Laplace), the sensitivity, the distribution, and the degree of confidence;
privacyParams: Will receive the privacy parameters such as ε (epsilon). The smaller the ε, the more privacy is maintained, but accuracy gets worse (due to more noise); if the ε value is very large it means that less noise is placed, so there will be a worsening in privacy; and
X: The data set that has data that needs to go through DP.

To obtain the private responses relative to the list of salaries we passed (employees$salary_list) we need to execute the following code:

The results obtained in my execution were:

# Salary List
[1] "privacy_params - gamma: 0.7"
[1] "privacy_params - delta: 0"
[1] "privacy_params - epsilon: 1"

[1] "response_without_privacy: 9830"
[1] "response_with_privacy: 9213"

In this case, the average salary we calculated earlier from the employee list was 9830; and using the differential privacy mechanism the value is at 9213. Remembering that the goal here is to have the same dynamics of the numbers even with the addition of noise instead of obtaining precise information (which can reveal sensitive information about employees).

The list with age can be calculated using the same parameters through the script below:

In this case, I had the following result:

# Age List
[1] "privacy_params - gamma: 0.5"
[1] "privacy_params - delta: 0"
[1] "privacy_params - epsilon: 1"

[1] "response_without_privacy: 45"
[1] "response_with_privacy: 43"

The values in my execution were very similar: 45 from the original average, versus 43 with differential privacy noise; but depending on the execution, the data may have other values due to the fact that we did not set a seed before execution.

Tradeoff: Utility x Privacy

Since the use of Differential Privacy mechanisms on data necessarily implies changing the data, it is natural that rigor is reduced; after all, we are putting uncertainty into the data.

High utility, no privacy; High privacy, no utility. Source: [Nicolas Sartor, Explaining Differential Privacy in 3 Levels of Difficulty](https://aircloak.com/explaining-differential-privacy/)

As this utility (rigor/consistency) x privacy (uncertainty/obfuscation) tradeoff is certain when we use Differential Privacy, there is no definitive rule that determines the use of one scenario or another, leaving the use to the organizations and professionals involved.

Some architecture models for data analysis systems considering differential privacy are already appearing and can serve as inspiration, such as Linkedin’s PriPeARL which guarantees privacy by design in the reporting and analytics part.

High-level design of Linkedin's differential privacy platform, which sits between the production base and analytics users. Source: [Practical Differential Privacy at LinkedIn with Ryan Rogers](https://twimlai.com/twiml-talk-346-practical-differential-privacy-at-linkedin-with-ryan-rogers/)

Final Considerations

Differential Privacy has numerous use cases, and within the Brazilian context where a regulatory tightening points on the horizon with the General Personal Data Protection Law (LGPD), it is more than necessary that CTOs/CIOs/CDOs, data analysts, data scientists, machine learning engineers, and other stakeholders are aligned with this type of mechanism that protects their customers and their own businesses.

The compliance of an environment that deals with personal data necessarily passes through the triad {Differential Privacy / Data Anonymization / Internal Security Systems}; and it is naive to think that only one of them will guarantee the minimum standard of compliance among all stakeholders.

This post aimed to present the implementation of DP mechanisms in an uncomplicated and practical way. There is an infinity of technical aspects to be considered such as sample sensitivity, privacy budget, restricted and unrestricted sensitivity, limited domain algorithms, Pay-what-you-get mechanisms, etc.

Compared to the potential risk of putting people in a vulnerable situation due to leakage of non-anonymized personal data, and even the volume of legal liability that this can cause; Differential Privacy presents itself as a great tool to minimize such risks.

References

Author’s Notes

[AN1] — Adversarial attacks of the “Membership Inference” type are those in which the attacker, through querying the model (can be an .R file or even a binary), manages to infer whether a specific record was or was not in the training set through the use of some attributes present in the model. In my talk on Security in Machine Learning I show how the personal information of a 91-year-old lady ended up in a Machine Learning model. The code is on Github.

[AN2] — If you are using the MacOS operating system, the installation of gsl (Gnu Scientific Library) is mandatory.

[AN3] — Neither the documentation nor the paper are clear about the influence of the confidence level on the relaxation (or not) factor of privacy using the RDP and the influence on sensitivity.