Differential Privacy and Machine Learning

2020 Oct 26

TL;DR: Some measures for maintaining privacy and security can help avoid or minimize the impact of a data leak/theft.

…

This week in Finland, a data theft occurred at a psychotherapy center called Vastaamo.

The object of the theft was the information from therapy sessions of a little over 40,000 people, which were extracted by hackers asking for 450,000 euros to not disclose the information.

Some of these people are already being directly blackmailed to make payments to the hackers, at the cost of having their psychological counseling information leaked (some with many years of therapy documented in this database).

This event is marginally related to two topics I addressed this year: my talk at PyCon Africa in August, where I spoke about security in Machine Learning and adversarial attacks, and the post about latent conditions and active failures in Machine Learning.

Although those topics didn’t speak directly about system intrusion and data leakage, I wanted to use this case as an example to bring attention back to these aspects of algorithmic and data security, and also to talk at a high level about differential privacy.

But first, what happened?

The Data Leak

The company Vastaamo is a psychotherapy center operating in various locations in Finland, functioning as a sort of psychotherapy and psychiatry franchise attending to diverse cases involving mental health.

Vastaamo’s main services include: treatment for depression, gambling addiction, anxiety, trauma, alcoholism, and general relationship problems.

Most of the counseling sessions were recorded in writing and stored in a database, unencrypted, without patient anonymization, and according to some reports, with all these records associated with address information, email, and personal data for each patient.

Given this scenario, it is technically possible for these 40,000 people to have their names, phone numbers, emails, and therapy sessions publicly revealed; which, besides being a heinous crime against people in extremely vulnerable positions, could potentially lead to numerous unmeasurable consequences. Even high-level politicians are being blackmailed for extortion.

An aggravating factor in all this is that even in the *press release* about the crisis, Vastaamo disclosed [1] that besides retaining therapy session information in its database for 12 years, it is supported by Finnish legislation [2] which excludes the possibility of removing these records, even upon patient request, clearly contradicting the European Union Data Protection Law, also known as the General Data Protection Regulation (GDPR).

And as if that weren’t enough, it seems the company does not have any kind of insurance policy regarding data leakage, or any kind of safeguard regarding a potential and obvious indemnity lawsuit from the victims.

The objective of this essay is to show that data professionals (engineers, scientists, and Machine Learning practitioners) can learn from these catastrophic events.

These events provide convincing arguments for the importance of implementing countermeasures to reduce or mitigate the probability of such events occurring at their employers.

Countermeasures against adversarial attacks in Machine Learning

In my talk at PyCon Africa this August, I spoke about “Security in Machine Learning Engineering: Adversarial attacks and countermeasures,” where in one example I open a file with the .RData extension and show some training instances.

The records, besides not being properly encrypted or anonymized, allow for the inference of database members, as in the example below used by Professor Karandeep Singh (@kdpsinghlab):

Translation: A 91-year-old Asian person with a medical history of psychosis and treated for hypertension is hospitalized by the University of California, San Francisco Hospital from 2016 to 2017. Source: [https://twitter.com/kdpsinghlab/status/1181474070006829057](https://twitter.com/kdpsinghlab/status/1181474070006829057)

The full code for the talk can be found in the Github repository.

The main point of the talk is that Machine Learning applications must take security aspects into account at all stages of their lifecycle, from data extraction to model monitoring in production.

If we analyze the types of adversarial attacks, they all essentially involve (i) data manipulation and/or (ii) failure to maintain data privacy in model generation. Some types of attacks include:

Besides these aspects, anonymization can help in the process of removing any link between users and records in the database.

The definition of anonymization according to the Brazilian General Data Protection Law (LGPD) is data that originally related to a person but went through stages ensuring it can no longer be linked to that person. If data is anonymized, then the LGPD does not apply to it.

Regarding anonymization, the LGPD also highlights that data is only considered effectively anonymized if it does not allow, through technical and other means, the path to “discover” who the data subject was—if identification occurs in any way, then it is not, in fact, anonymized data but only pseudonymized data and will then be subject to the LGPD.

A countermeasure that is rarely discussed but can be used by machine learning engineers is differential privacy, which can be applied during data collection, transformation, and even at the time of model inference.

But first, let’s look at the definition of differential privacy and a rough example of what this mechanism exactly is.

What is Differential Privacy?

Using the definition from Wikipedia, differential privacy is a system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset. The idea behind differential privacy is that if the effect of making a single arbitrary substitution in the database is small enough, the query result cannot be used to infer much about a single individual, and therefore provides privacy.

In other words, some records will have a bit of noise added so that it’s impossible to link/associate records with users, while at the same time, the dynamics of aggregated data will not be affected during data analysis or Machine Learning model training.

A simple example of differential privacy

To understand roughly what differential privacy is, let’s imagine a scenario where a research institute opens a survey to interview random people.

But the subject of this research isn’t something like asking about voting intentions for a mayoral or presidential candidate. The study aims to find out something much more personal through the following question:

“Have you ever cheated on your spouse/partner?”

The question itself might seem indiscreet and uncomfortable, but since the object of research requires it, it must be asked.

However, some people might feel uncomfortable giving a sincere response; so a method is proposed by the interviewer:

“I’m going to toss a coin. If it’s Heads, you must tell me the truth. If it’s Tails, you tell me whatever you want, whether it’s the truth or not.”

And here is where the differential privacy aspect enters our research: One thing no one besides the researchers knows is that instead of using a fair coin (50% chance for each side), the researchers used a biased coin with a 70% bias toward Heads.

In practice, this means that for every 10 interviewees, there is a probability that 7 of them will have to tell the truth.

And with the exception of the researchers, everyone analyzing this aggregated data will not know the probability of the answers being correct or not. However, everyone will know there is undetermined noise along with correct answers in a way that there’s no way to associate records with the interviewees.

This mechanism solves two problems: (i) it maintains the privacy of those responding to the survey given an undetermined uncertainty aspect, and (ii) it makes linking/associating responses to individuals much harder, whether for those analyzing data or in a potential data leak scenario.

Packages and solutions for differential privacy

Directly from the excellent EthicalML, I’ve extracted some alternatives for using differential privacy:

Google’s Differential Privacy — This is a C++ library of ε-differentially private algorithms, which can be used to produce aggregate statistics on numerical datasets containing private or sensitive information.
Microsoft SEAL — Microsoft SEAL is an easy-to-use open-source homomorphic encryption library (licensed under MIT) developed by the Cryptography Research group at Microsoft.
PySyft — PySyft decouples private data from model training, using Multi-Party Computation (MPC) in PyTorch.
Tensorflow Privacy — A Python library that includes implementations of TensorFlow optimizers for training machine learning models with differential privacy.
Uber SQL Differential Privacy — Uber’s library for using differential privacy for SQL queries.

If there’s demand, I’ll create some tutorials here on the blog or on Medium about each of these tools.

Final Considerations

No one yet knows the consequences: neither regarding the people who stole the data nor especially regarding the victims who had their intimacy and psychological counseling exposed.

As I mentioned in my article “Machine Learning and the swiss cheese model: active failures and latent conditions”, this problem isn’t due to a single factor but multiple active failures and latent conditions that aligned and determined the occurrence of this data theft, such as:

Latent conditions: absence of differential privacy mechanisms to hinder information reversibility, lack of encryption for sensitive information regarding therapy sessions, centralization of information in a single repository, absence of patient data anonymization;
Active failures: legislation violating GDPR privacy principles, technology teams not following best information security practices, and even crisis management where the company didn’t even offer English-language support for victims who don’t speak Finnish (even though their sessions were in English).

For machine learning engineers, there is a major lesson that security and privacy are serious matters, and these failures show how technology and its processes must be handled as prudently and responsibly as possible.

And for technology colleagues in Finland, it seems the march of regulation and the government’s heavy hand will act.

…

If you are in a crisis situation, seek specialized help. Life is always worth it. Below are some resources for crisis situations:

CVV (Brazil) — Phone 188 or www.cvv.org.br
Ministry of Health (Brazil)
Portal Acolha a Vida

…

References

Cynthia Dwork, Differential Privacy — https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/dwork.pdf
A Brief Introduction to Differential Privacy — https://medium.com/georgian-impact-blog/a-brief-introduction-to-differential-privacy-eacf8722283b
The Algorithmic Foundations of Differential Privacy — https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf
An Nguyen, Understanding Differential Privacy — https://towardsdatascience.com/understanding-differential-privacy-85ce191e198a
Ted is writing things, Why differential privacy is awesome — https://desfontain.es/privacy/differential-privacy-awesomeness.html
Tutorial: Differential Privacy and Learning: The Tools, The Results, and The Frontier — https://www.youtube.com/watch?v=hoEyvHCRRc8

Notes

[1] Original text regarding legislation concerning data usage:

Q: Can customer information be deleted upon request according to data protection legislation? Several requests we received concerned the deletion of customer information. However, the therapeutic patient information processed by Vastaamo belongs to statutory patient documents. Patient documents refer to documents or technical recordings used, prepared, or received in the organization and implementation of patient care, which contain personal health information or other personal details (Patient Status and Rights Act 785/1992). A great deal of special legislation applies to patient data (detailed further in the statement), which excludes or limits several rights under the General Data Protection Regulation, including the right to data deletion. In the healthcare context, data of the registered cannot be deleted upon request. We cannot therefore delete data, as healthcare professionals have an obligation to prepare patient document entries for all patient service events, such as visits. These entries must be kept for the time specified in the Ministry of Social Affairs and Health patient document decree (298/2009) and its annex. Furthermore, an individual entry concerning a service event cannot be deleted unless it is a clearly incorrect entry. The assessment is made from the perspective of the understanding prevailing at the time the entry was prepared. If some information proves incorrect and is corrected, both original and corrected information must be readable later. Even in a situation where information unnecessary for patient care is deleted, an entry about the deletion must be made in the patient documents. Q: How long is patient data kept? All patient documents must be kept for at least 12 years after the information was entered; however, most data must be kept for the patient’s entire lifetime and another 12 years after death (Ministry of Social Affairs and Health Decree on Patient Documents https://finlex.fi/fi/laki/alkup/2009/20090298, see annex at bottom of page). Vastaamo deletes personal data it no longer has a statutory obligation to keep as soon as possible.

[2] Additional leaflet about data legislation information: https://vastaamo.fi/files/potilasrekisteri_tietosuojaseloste.pdf