Data: The New Oil or the New Uranium?

2023 May 22

Data: The New Oil or the New Uranium?

Originally posted on April 4, 2021.

A pellet of highly enriched uranium. [Source: Wikipedia](https://pt.wikipedia.org/wiki/Ur%C3%A2nio_enriquecido#/media/Ficheiro:HEUranium.jpg)

I have been thinking lately about the role of data and machine learning engineers in terms of how we work, especially regarding the privacy of all the people who generate the data we analyze, mine, create models from, and store in our employers and/or organizations.

In particular, two events served as turning points in my mind regarding these privacy aspects.

The first was the data mega-leak in Brazil that compromised more than 223 million CPF numbers, and curiously it is still not known (what) the sources of these data are, and no one yet knows the extent and chain of consequences.

The second event was the theft of therapy session information of approximately 40,000 people in Finland; which, besides being something abject in itself, shows that we are much more exposed than ever; where a psychologist’s old notebook, which would have had at most local exposure (if any), has become something that can be spread and sold all over the internet.

Because of this, I began to wonder about the actual utility of storing this data considering numerous perspectives: data analysis, machine learning, and even legal aspects.

The conclusion I reached was that in the vast majority of cases, unlike what The Economist says, [**data is not the new oil](https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data) or even the new electricity as they want us to believe; rather, data is much closer to being a new uranium, **due to its extremely high potential for utility, but without the proper infrastructure, the entire management cost can lead to catastrophic situations.

Wölsendorfite, a rare mineral containing uranium in its composition. Photo: Marcel Clemens. Source: [InfoEscola](https://www.infoescola.com/quimica/enriquecimento-de-uranio/).

…

Although these data leak events were the trigger for my reflection, the final inspiration for this essay was undoubtedly Bruce Schneier’s article called [Data Is a Toxic Asset, So Why Not Throw It Out?](https://www.schneier.com/essays/archives/2016/03/data_is_a_toxic_asse.html) (Data is a toxic asset, so why not throw it out?).

Schneier lays out in his essay all the risks involved when we store data, and concludes that data is a toxic asset given the risk structure involved in information storage.

He points out that the 3 main reasons why companies keep data as “assets” are:

(i) the fact that we are in the middle of the Big Data hype cycle where companies don’t even know what to do with the data but continue to collect it frantically;

(ii) companies are simply downplaying the risks and consequences of what might happen if a leak occurs, and finally

(iii) the fact that there are companies that understand points (i) and (ii) and still store the data considering a potential business model in which part of this data can be used, even with all the risk structure involved.

I tend to agree with this point of view, but I think that this agenda, besides having low practicality, does not consider some mechanisms that can reduce the chances of a leak and minimize the consequences of a potential privacy violation event.

…

In an ideal world, we would have in every company “nirvana situations” where (i) all data would be anonymized and decentralized such that a potential leak would reveal little or no information about any member of the data set, and (ii) data storage systems would be robust enough to contain any kind of intrusion or leak due to infrastructure, security, and training mechanisms for the people who work in the ingestion and consumption of this data.

And of course, these two “nirvana situations” would not be possible without (i) staff training, (ii) investments in security infrastructure and a change in data collection and storage methods, and (iii) potential loss of information for companies.

However, since we do not live in an ideal world and the technological and financial viability of all this can be prohibitive for the vast majority of companies, what can be done is the use of solutions to minimize the attack surface/vulnerabilities.

Krusty nuclear mini-reactor. Source: [Home Made Tools](https://www.homemadetools.net/forum/nasa-krusty-nuclear-reactor-photo-78861)

Potential alternatives to minimize a potential data leak

Personally, I don’t know if there is a canonical solution to the problem of data storage regarding privacy issues, considering implementation costs against the costs of a legal dispute due to privacy violation.

However, I think some practical measures can be used to minimize some of the risks in storing data.

[Data Provenance](https://www.artworkarchive.com/blog/provenance-what-is-it-and-why-should-it-matter-to-you): Provenance can be established at the time of data acquisition with the identification of the origin of each record. An example of this can be a column with the timestamp and source metadata. For example, in the art world, most legally sold works have both a certificate of authenticity and the origin of who sold it. In some works, provenance is so strict that the original buyer can be traced over a period of more than 90+ years. In the case of systems, it can be which application gave rise to it, IP address, information on whether users were authenticated or not, and in the case of external data, the source reference. I had the opportunity to work for some years in the [skip tracing](https://en.wikipedia.org/wiki/Skiptrace) sector for the credit derivatives market in Brazil, and I can state that today it is virtually impossible to know how information providers are buying or selling data, and how it is being curated. And there’s no secret here: No provenance? Do not acquire the data. From the moment your organization/company puts data into its database, it officially becomes the source of that information, and the legally responsible party.

Differential Privacy: I already posted this point on the Data Hackers blog in the posts “[What is Differential Privacy and how is it related to security in Machine Learning?](https://medium.com/data-hackers/o-que-%C3%A9-a-privacidade-diferencial-e-qual-a-rela%C3%A7%C3%A3o-com-seguran%C3%A7a-em-machine-learning-bcba0d72eba6)” and in a practical example in R in the post “[Differential Privacy in R using diffpriv](https://medium.com/data-hackers/privacidade-diferencial-no-r-usando-o-diffpriv-4478035d4697)”. Differential privacy shines due to the fact that it is not concerned with “who is in the database” but rather “what are the dynamics behind an aggregated behavior such that no individual can have their individual behavior revealed”. Here I like to use the analogy between supermarkets and most websites that abuse tracking on the modern web. Most modern websites use excessive tracking to collect data for personalization and consumer goods recommendations with very low/questionable utility, such as news portal banners or “personalized ads”. It is no coincidence that some companies are moving away from this model and betting on other alternatives. In contrast, the modern layout model of supermarkets shows how to develop a way to capture a dynamic without needing to collect a truckload of personal data to induce behavior or take the friction out of a purchase process. Supermarkets don’t know who people are individually, but they know from the dynamics of the data they collect that it is always …

[Federated Learning](https://ai.googleblog.com/2017/04/federated-learning-collaborative.html): Federated learning is a machine learning technique that trains models on numerous servers or devices in a totally decentralized way; so that no type of training data sharing happens between the instances that perform the algorithm adjustment. Thus, the only information shared is related to training parameters (e.g. gradient update data, convergence information, loss rate during training, etc) in each of the places where training occurs. This maximizes data privacy and security, as there is no central repository with all the data as in the traditional way of training machine learning algorithms.

Anonymization of records: This is a more traditional and simple alternative that, if well done, can help in the vast majority of cases. An interesting case was that of a digital bank that, despite having users’ personal data due to legal obligation, internally only made anonymized data available to marketing and data science teams, or with [Feature Hashing](https://arxiv.org/abs/0902.2206) (already previously anonymized) by the data engineering team.

Eliminating capture of irrelevant data: One thing I have been observing in the industry for a few years is the natural impulse to capture as much data as possible, and in some situations, without any type of criteria. If I had to think of some hypotheses for why the capture of useless data has become the standard, I would say they are:

(i) With the lowering price of storage, the marginal cost for generating new data has practically zeroed. In practice, this means that the old problem of storing a high volume of data has been minimized, and now the challenge is to bring in more data as fast as possible;
(ii) The shift from a paradigm of transaction-based architecture to event-driven architectures. The point is quite simple: If in the past applications were made that hit a database, performed a transaction via stored procedure, and stored only a closed cycle of that transaction, today the name of the game is practically capturing all possible state changes, communicating this between other systems, and finally storing it as fast as possible. Even if it means implementing a totally expendable technology stack doing [overarchitecture](https://www.nemil.com/on-software-engineering/beware-engineering-media.html) based on hype. It’s no wonder that all the hype about [Data Mesh](https://martinfowler.com/articles/data-monolith-to-mesh.html) as an architecture proposal sounds almost like modern counterculture moving away from the “collect everything” archetype to “collect the minimum necessary for the domain within a bounded context”;
(iii) Platforms are increasingly data-hungry: For some reason beyond my knowledge, any platform doing trivial business flows requires a large amount of information that not even notary or public services ask for. A simple example is food service or beverage delivery sites asking for increasingly personal information like CPF, date of birth, etc. A simple trip to a pharmacy in Brazil, and the clerk asks for all this to sell a simple aspirin;
(iv) Increasingly complex (and increasingly questionable) analytics demands: I recognize that competition in some industries has reached a point where any marginal differentiator can mean success or disruption, and that data analysis has a fundamental role. What I wonder is: do the clients of these analyses have the necessary training or knowledge to ask the right questions that motivate the collection of so much data? Let’s think objectively: How many clients understand the importance of statistical significance tests to evaluate the effectiveness of an A/B test? How many of these clients have the perception that causal aspects can affect their KPIs and that therefore part of their product interventions are, at best, a theater of results that confuses association with cause? My point is not against data analysis, but the fact that the wrong question not only yields no business result, but also involves an entire engineering effort behind the operationalization of this data (more engineering time to collect the data, more computational resources allocated, more damage to user experience with increasingly slower sites, more engineering time to optimize response times of sites and apps that are artificially slow, etc);
(v) Because of (ii) and (iii), and questionably because of (iv), [a bunch of data streaming solutions emerged](https://vicki.substack.com/p/you-dont-need-kafka) to move the largest volume of data in the shortest possible time, [even considering the engineering costs for implementing these solutions](https://vicki.substack.com/p/you-dont-need-kafka). This is because I didn’t mention the fact that while these tools help in a brutal reduction in data latency and storage, many business domains simply don’t have the volume that justifies the implementation of these technologies for data capture and storage.

Data Purging: It is probably the simplest solution, and at the same time the hardest to sell, given that it goes against the current dogma of “more data is better”, even when most of this data has almost zero utility from the point of view of use or causal relationship that can leverage the company/organization. Just like in offline life, unnecessary data being stored is, at the end of the day, trash. Plain and simple. We can rationalize and try to push some utility on unsuspecting stakeholders, but in the end, it’s just trash that will enrich cloud storage providers or external HD vendors. The reasoning is simple: Does your company that sells medicine really need the CPF of everyone who doesn’t take controlled medication? Does your food delivery app really need the GPS location and access to the file system of your users’ phones even outside of a delivery context? Does your affair dating app really need to store its users’ names and addresses in text? Does your video app really need 7TB of totally unencrypted data? Do hotels really need to store passport numbers for years on end? Does your chatbot really need to consider private conversation data from users for algorithm training?

Final Considerations

Deep down, every person who works with data knows (or should know) that there are inherent risks in the storage, access, and manipulation of this information; however, the absence of a greater perception on this topic of privacy is something that still has to be worked on.

As I stated earlier, some of these techniques used together can help minimize the impact of a potential leak or mitigate some of the risks inherent in data retention and use activities.

Numerous countries in the world are on a regulatory march regarding the use of user data; in Brazil with the General Personal Data Protection Law (LGPD) and in Europe with the General Data Protection Regulation (GDPR). Because of this, the room to operate in gray zones of regulation is not only smaller, but there is already jurisdiction for heavy fines and penalties. In other words, the whole thing moved from the childish motto ‘[Move fast and break things](https://hbr.org/2019/01/the-era-of-move-fast-and-break-things-is-over)’ and has now become “Be professional, know the tradeoffs, run the risk and deal with the consequences”.

The implementation or not of some of these techniques depends not only on investment, but also on intellectual capital, and a feasibility assessment; with the latter being potentially a competitor to business aspects. Fair and understandable.

However, there is a big difference between thinking your company is sitting on an infinite oil basin in the Persian Gulf, when in reality it is more like a pre-Chernobyl.

New protective enclosure for reactor 4. [Source: Wikipedia](https://en.wikipedia.org/wiki/Chernobyl_disaster#/media/File:NSC-Oct-2017.jpg).

I want to know what you think about this, if possible leave a comment or contact me on Twitter or at the email on my Medium profile.

[Note 1] — Incidental reference to this tweet by Silvio Meira. I wrote this phrase down somewhere and thought about it for some time, and by pure synchronicity this post came out.