Differential Privacy in R using diffpriv
2021 Jan 04A while ago, on Data Hackers, I posted about the consequences of relying solely on anonymization, and how Differential Privacy (DP) can help minimize adversarial attacks, such as Membership Inference [AN1], among other types of attacks/failures.
However, in this post, I will try to bring some practical aspects and, of course, some code to make these concepts a bit more concrete.
For those just arriving at this post and not knowing what Differential Privacy is, I recommend reading the article “What is Differential Privacy and how is it related to security in Machine Learning?” which has a very simple explanation of the subject. Today’s post will be about potential uses, implementation, and trade-offs between rigor and uncertainty when using DP.
Uses of Differential Privacy
Differential Privacy can be applied in several scenarios such as:
-
Situations requiring the analysis of sensitive data where privacy cannot be violated (e.g., health information of a group of people, banking information, etc.);
-
To increase the robustness of Machine Learning algorithms regarding unseen data (e.g., applying DP multiple times to the training base with different epsilon modifications);
-
Minimizing Data Leakage through the reuse of holdout bases modified with DP mechanisms; this mechanism is known as Adaptive Data Analysis;
-
In Stepwise regression models during model fitting where the training base is reused using Differential Privacy to remove sampling bias;
-
In making synthetic data or micro-data available (e.g., census information); and
-
Generating databases with minimum privacy guarantees for Machine Learning models (and even data augmentation (Data Augmentation)).
These are just a few scenarios illustrating the potential use of Differential Privacy in everyday work.
However, let’s move on to the code part in a sensitive data analysis scenario.
Using diffpriv for Differential Privacy Application
The diffpriv is an R library created in 2017 by Benjamin Rubinstein and Francesco Alda for analyzing data that requires the application of differential privacy; for example, making reports/databases/analyses available to third parties.
diffpriv contains the implementation of generic mechanisms to transform private data. The mechanisms present are: Laplace, Gaussian, Exponential, and the Bernstein Mechanism. The package documentation is very complete and the explanations are very didactic.
In our case, we will conduct an analysis of data related to the salary and age of people in a certain company. The idea is to see the differential privacy mechanism in practice regarding the final result.
First, let’s install the gsl and diffpriv libraries [AN2]:
With the libraries loaded, let’s create a dataset with the employees of this company:
Now that we have our dataset in R, let’s create a function to calculate the mean; a function we will use to calculate the average age and salaries of this company:
We can see that the average age of the people in the dataset is 45 years and the average salary is 9830 BRL.
However, this precise information cannot be passed on as we only want to represent the data dynamics instead of offering an analysis/mechanism that violates the privacy of these people.
Therefore, let’s proceed to create a differential privacy mechanism using diffpriv.
The diffpriv library has 4 noise addition mechanisms for Differential Privacy. They are:
-
Laplace Mechanism: The Laplace Mechanism will simply compute the function f() and perturb each record with noise, and the noise scale will be calibrated according to sensitivity, preserving privacy (ε, 0); and
-
Bernstein Mechanism: It is based on the mechanism developed in the seminal work by Alda and Rubinstein (2017). This mechanism uses Bernstein polynomials to approximate the privacy function, working with perturbations applied to coefficients instead of perturbing the values themselves. Since the coefficients are the only components of the function that will suffer perturbations, this is enough for privacy preservation.
That said, let’s choose the Laplace Mechanism which will perform noise addition to keep our numerical answers private.
To do this, we will pass our retorna_media function to this Laplace mechanism (which we will call mecanismo_privacidade) as follows:
Within the DPMechLaplace mechanism, we have the following attributes:
-
sensitivity: This variable will determine the inherent perturbation amplitude of the function; a perturbation we will pass to our numerical list. In other words, sensitivity is the magnitude by which the data of a single individual can alter function f in the worst case. A higher sensitivity necessarily increases privacy but also uncertainty/noise in the data; -
target: A non-private function whose output will be privatized, expecting a list of values. The Laplace Mechanism expects functions that return numerical vectors as a result; -
gammaSensitivity: Privacy confidence level. This means high confidence will demand higher sensitivity (and a wider noise spectrum, i.e., more uncertainty in the data and less utility) while a lower confidence level reduces the noise spectrum, thereby increasing utility (as it has less uncertainty/noise) but reducing the level of privacy.
After defining our privacy mechanism, let’s create a function choosing a distribution to pass as noise. In this case, we will choose the normal distribution (rnorm) to generate the number of records n (which in our case will be the number of employees) and perform an execution test.
I multiplied the results by 10,000 just so the result of this list would be on the same scale as the salaries. The distribution must always be on the same scale as the response variable to be privatized.
An important piece we must create is the sample generation mechanism that will take into account the sensitivity we determined in mecanismo_privacidade in addition to the distribution function that will generate the randomized samples.
The intuition behind the sample generation mechanism based on sensitivity (which in diffpriv is called sensitivitySampler()) is that for each record in the database, independent and identically distributed random records will be created using the privacy mechanism we previously chose. All this while considering the fact that noise will be generated according to the chosen sensitivity (i.e. within a certain noise spectrum).
This is implemented in diffpriv as follows:
The sensitivitySampler() function has the following parameters:
-
object: An object of class DPMech-class (i.e., a privacy mechanism); -
oracle: An oracle that will be the source of random records. A function that returns a list, matrix, or data.frame; -
n: Number of records contained in the dataset. In this case, it must be exactly consistent with the number of records that will go through the DP process; -
m: Integer value indicating sensitivity relative to sample size; and -
gamma: Privacy confidence level according to the Rényi Differential Privacy (RDP) function [AN3]
With our mecanismo_resposta_privado object created, we can check some of its values by executing the following code block:
With our private response mechanism created, we can now move to the part where we generate our privatized responses regarding the database we created at the beginning of this post.
To do this, we will call the releaseResponse function as in the snippet below:
The releaseResponse parameters are:
-
mechanism: An object of type DPMech-class, i.e., thesensitivitySamplewe instantiated previously, which will be responsible for generating our randomized responses considering the mechanism we chose (Laplace), the sensitivity, the distribution, and the degree of confidence; -
privacyParams: Will receive privacy parameters such as ε (epsilon). The smaller the ε, the more privacy is maintained, but accuracy worsens (due to more noise); if the value of ε is very large, it means less noise is added, so privacy will worsen; and -
X: The database containing data that needs to go through DP.
To obtain private responses relative to the salary list we passed (empregados$lista_salarios), we need to execute the following code:
The results obtained in my execution were:
# Salary List
[1] "privacy_params - gamma: 0.7"
[1] "privacy_params - delta: 0"
[1] "privacy_params - epsilon: 1"
[1] "response_without_privacy: 9830"
[1] "response_with_privacy: 9213"
In this case, the average salary we previously calculated from the employee list was 9830; using the differential privacy mechanism, the value is at 9213. Remember that the goal here is to have the same dynamics of numbers even with the addition of noise instead of obtaining precise information (which can reveal sensitive employee information).
The age list can be calculated using the same parameters through the script below:
In this case, I had the following result:
# Age List
[1] "privacy_params - gamma: 0.5"
[1] "privacy_params - delta: 0"
[1] "privacy_params - epsilon: 1"
[1] "response_without_privacy: 45"
[1] "response_with_privacy: 43"
The values in my execution were quite similar: 45 for the original average, against 43 with differential privacy noise; but depending on the execution, the data may have other values due to the fact that we did not set a seed before execution.
Trade-off: Utility x Privacy
Given that using Differential Privacy mechanisms on data necessarily implies changing the data, it is natural that rigor is reduced, after all, we are introducing uncertainty into the data.
](https://cdn-images-1.medium.com/max/2000/0*S8bkJZJDR33JeIJN.jpg)
Since this utility (rigor/consistency) x privacy (uncertainty/obfuscation) trade-off is certain when using Differential Privacy, there is no definitive rule determining the use of one scenario or another, leaving the decision to the organizations and professionals involved.
Some architecture models for data analysis systems considering differential privacy are already appearing and can serve as inspiration, such as LinkedIn’s PriPeARL which guarantees privacy by design in the reporting and analytics section.
](https://cdn-images-1.medium.com/max/3200/0*e8O_8GrHy3bWyBU-.jpg)
Final Considerations
Differential Privacy has numerous use cases, and within the Brazilian context where regulatory tightening points on the horizon with the General Personal Data Protection Law (LGPD), it is more than necessary for CTOs/CIOs/CDOs, data analysts, data scientists, machine learning engineers, and other stakeholders to be aligned with this type of mechanism that protects their customers and their own businesses.
The compliance of an environment dealing with personal data necessarily passes through the triad {Differential Privacy / Data Anonymization / Internal Security Systems}; and it is naive to think that only one of them will guarantee the minimum standard of compliance among all stakeholders.
This post aimed to present the implementation of DP mechanisms in an uncomplicated and practical way. There is an infinity of technical aspects to be considered such as sample sensitivity, privacy budget, restricted and unrestricted sensitivity, limited domain algorithms, Pay-what-you-get mechanisms, etc.
Compared to the potential risk of putting people in a fragile situation due to non-anonymized personal data leaks, and even the volume of legal liability this can cause; the Differential Privacy presents itself as a great tool to minimize such risks.
References
-
Practical Differentially Private Top-k Selection with Pay-what-you-get Composition
-
The Bernstein Mechanism: Function Release under Differential Privacy
-
LinkedIn’s Audience Engagements API: A Privacy Preserving Data Analytics System at Scale
-
Pain-Free Random Differential Privacy with Sensitivity Sampling
-
The reusable holdout: Preserving validity in adaptive data analysis
-
The Bernstein Mechanism: Function Release under Differential Privacy
-
PriPeARL: A Framework for Privacy-Preserving Analytics and Reporting at LinkedIn
-
Practical Differentially Private Top-k Selection with Pay-what-you-get Composition
-
Difference between ε-differential privacy and (ε, δ)-differential privacy
Author Notes
[AN1] — Adversarial attacks of the “Membership Inference” type are those where the attacker, through querying the model (which can be an .R file or even a binary), can infer whether a specific record was or was not in the training set through the use of some attributes present in the model. In my talk Security in Machine Learning, I show how personal information of a 91-year-old lady ended up in a Machine Learning model. The code is on GitHub.
[AN2] — If you are using the MacOS operating system, installing the gsl (GNU Scientific Library) is mandatory.
[AN3] — Neither the documentation nor the paper are clear about the influence of the confidence level on the relaxation (or not) of privacy using RDP and its influence on sensitivity.