Accountability, Core Machine Learning, and Machine Learning Operations
2020 Mar 29Anyone following the technology debate through academia, industry, conferences, and the media has noticed that Artificial Intelligence (AI) and its sub-fields are the hottest topics at the moment.
Some companies that have the core of their business in digital systems/platforms (and some non-digital ones) have understood that the use of Machine Learning (ML) has great potential both for cases of optimization in the way the company works and for cases of direct revenue generation.
This can be seen in numerous businesses ranging from the banking sector, through recommendation systems for entertainment, and even reaching medical applications.
This post will attempt to briefly describe how ML is shaping many businesses, provide a short reflection on the march of accountability [1], and finally offer some brief considerations regarding Core Machine Learning teams and the Machine Learning Operations (MLOps) approach.
How is Machine Learning shaping some industries and what is the degree of responsibility for engineering teams?
With the adoption of machine learning by industry, a natural movement began where both machine learning and industry are shaping each other.
If at first the industry benefits from Machine Learning platforms to obtain predictions, classifications, inferences, and decision-making at scale with marginal costs near zero; Machine Learning benefits from industry through access to research and development resources unimaginable in academia, access to resources that would have an unfeasible cost for conducting studies, and an increase in the maturity of its methods in terms of engineering.
However, what we are ultimately talking about here is the scale at which decisions are made in industry, and how R&D in Machine Learning is advancing at a very high speed.
That said, we can state that today these systems are no longer in the harmless arena of ideas and proofs of concept; rather, they are active elements in processes of interaction between people and businesses on a massive high scale.
And due to this scale, a series of new questions that were not as concerning or were hidden in the past now take on greater importance, such as:
-
If in the past a white manager refused a loan to a client based on skin color, gender, or disability; little or almost nothing happened to this manager and/or bank. Today, Machine Learning algorithms without proper analysis and monitoring of their outputs can automate and amplify this type of bias, thus placing the bank in unprecedented legal and public relations liabilities. Currently, there are people working in disciplines such as fairness and ethics to minimize these problems.
-
If formerly television pushed all kinds of programming, no matter how degrading it was; today the entertainment industry’s algorithms are forced to take into account a higher level of quality, under penalty of negative scrutiny by public opinion and the media.
-
If in the past less popular artists were hostages of the famous Jabá (i.e. being played on the radio upon payment) in a scenario where only artists with big managers had preference; today streaming platforms not only deal with a more diverse musical market but are starting to take into account relevance for the user while trying to promote equity among artists within the platform.
-
In the past, if we needed to perform diabetic retinopathy tests we depended only on the doctor - who is a fallible being like any other -; today we have auxiliary systems that can provide a second opinion and give a chance for review in cases of underdiagnosis.
-
If some time ago we depended exclusively on legal systems that could have numerous biases where someone’s fate depended on the good mood of state agents; today disciplines like fairness and transparency are already helping to minimize these biases in the legal sector and can provide a fairer, more auditable, and primarily faster judgment.
As we can see in these examples, current aspects such as structural human biases, lack of diversity, structural promotion of injustice, and abuse of authority can be minimized with ML using tools such as Fairness, transparency, accountability, and explainability.
And given the points raised above, it is unnecessary to state the importance and responsibility of each ML professional to ensure that an automated decision does not include and/or amplify these systematic biases.
One of the greatest truths in technology is that computer systems most of the time work to amplify behaviors and skills. An ML system that does not take structural biases into account is fated not only to continue but also to amplify these same biases on a high scale.
And given the enormous authority engineering has in relation to the implementation of these systems, accountability will automatically come with the same intensity as the degree of impact of these solutions.
Accountability will come voluntarily and/or coercively
Given all the scenarios where ML platforms have a direct impact on industry, and all the potential risks and impacts on society, there is a regulatory march coming from numerous fronts that will place much greater accountability on companies and ML engineers.
This accountability will essentially be related to sensitive aspects that concern society as a whole: ethics, fairness, diversity, privacy, security, the right to explanation of algorithmic decisions (for those under the GDPR), besides, of course, specific ML aspects (e.g. reproducibility, model evaluation, etc.).
In this way, this more than ever places great pressure on all of us engineers, data scientists, product managers, CTOs, CEOs, and other stakeholders to not only do our job but also pay attention to all these aspects.
If this scenario sounds distant or out of reality, I invite the most skeptical to honestly answer the following questions regarding their current employer:
-
Would your company have the stature to resist press scrutiny due to an innocent experiment by the product and Data Science teams that happened to intentionally manipulate almost 700,000 users? An experiment that may have broken medical ethics protocols?
-
Could your company survive a public scrutiny campaign due to a marketing campaign where the Data Science team made a predictive model to recommend articles by mail to potentially pregnant women, some of whom did not even know they were pregnant or had even shared this information?
-
What would happen if the credit scoring algorithm developed by you placed your company in a situation of having to defend itself against a gender discrimination bias on social media? Such as, in this case that already has more than 28,000 likes on Twitter.
-
Would your current employer be prepared to lose 99% of market value in 6 months, plus a 12 million dollar fine due to Glue Code that was put into production because of technical debt in versioning and code review?
I could list numerous other cases that are already happening today, but I believe I have made my point. For those who want to know more, I recommend Cathy O’Neil’s book called Weapons of Math Destruction which shows some of these scenarios or the talk based on the book, called ”The era of blind faith in Big Data must end”.
Furthermore, if this accountability does not come via the market, it will necessarily come through the coercive route of state regulation; the latter which at this moment is being developed by numerous governments worldwide to impute accountability, both on companies and individuals.
This can be seen in numerous observatories and Think Tanks such as The AI4EU Observatory, in some OECD recommendations regarding Artificial Intelligence, and in recent guidelines released by national AI strategies in countries like Estonia, Finland, Germany, China, United States, France, and the European Union Commission itself which has clearly said it will massively regulate AI from the perspective of risk and transparency.
This means, in the last instance, that errors in a system that interacts directly with human beings will result in a chain of consequences totally distinct from what we currently have.
Given this extremely complex scenario, we can deduce that if the era of the “analyst-with-a-script-on-their-own-machine” has not ended, it is on the way to happening much faster than we can imagine; whether through professionalism and awareness, or through coercion, threats, and/or fiduciary losses.
And do not be fooled by those who say that you are just “a person who must follow orders” and that nothing will happen. At the moment your company has any kind of civil/criminal/public relations problem, you will be co-responsible. And there is already a precedent of an engineer who is going to jail because of bad practices within their craft. And here it won’t just be a question of “if”, but rather “when” this will reach software engineering in ML.
The message I want to leave here is not one of despair or even inducing situations of corporate confrontation. What I want to leave as a final message is just that we must have situational awareness of this march of accountability/responsibility and why this will be inevitable.
In other words: Critical thinking is an intrinsic part of the job, you are responsible for what you do, and the value of this is already embedded in your salary.
Core Machine Learning
The first time I had contact with the Core Machine Learning approach was in mid-2015 at the Strata Data Conference and continuing in 2016 through some talks by Hussein Mehanna. However, it was only in 2017 at Facebook @Scale after contact with people from the industry that I could understand a bit more what this approach was.
Not that there is a formal definition, but basically a Core Machine Learning team would be responsible for developing Machine Learning platforms within the Core Business of organizations; whether embedding algorithms in existing platforms or delivering inference/prediction services via APIs.
Part of this team’s mission would be to deal directly with all machine learning application initiatives within the company’s main activity. This ranges from applied research, adoption of software engineering practices in ML, to the construction of the infrastructure part of these applications.
Thinking about the new economy that is here to stay, in my view, we are in the middle of a transition of product development paradigms.
On one side we have a paradigm that focuses on building static applications that are concerned with business flows. On the other side we have a paradigm that inherits the same characteristics but uses data to leverage these applications.
Obviously there is much hype and much solutionism using ML, but I am talking here about companies that manage to apply ML in an opportunistic way and with pragmatism for building these applications.
In other words: the algorithm on the platform becomes the product itself.
Let’s see some examples of platforms where the algorithm is the product:
-
Google: PageRank is an algorithm that was originally born from a master’s thesis and subsequently became Google’s core business;
-
Spotify: Discovery Weekly is a good example of a Machine Learning feature that ended up becoming a product within the platform;
-
Netflix: Needless to say that recommendation, personalization, and search is the name of the game for them;
-
Uber: The mobility service giant uses a mixture of classic methods and Deep Learning to perform planning and forecasting for its marketplace;
-
Facebook: A large part of revenue originates from advertising auctions within the platform. Auction arbitrage uses some Machine Learning as we can see in this great talk by Eric Sodomka called “On How Machine Learning and Auction Theory Power Facebook Advertising”;
-
iFood: The Brazilian food delivery giant uses a hybrid system with rules and machine learning in one of the platform’s most sensitive processes, which is fighting fraud;
-
OLX: The previously innocent classifieds platform now has part of its business using machine learning massively in its recommendation and matching systems;
-
Movile: A large part of Movile’s revenue flow in the past was sustained by a machine learning system that did the monitoring and alarm work using classification in time series; and
-
Quinto Andar: While for some the platform is just a real estate classifieds site, behind the scenes there are machine learning systems that help some of the customers find their next home.
These are some of the most famous public examples of some machine learning cases in the core business of businesses both in Brazil and elsewhere.
A very interesting way to understand how some algorithms helped in leveraging products can be seen in the paper Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective by Facebook:

Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective
Of course, we know that internally things are not quite so simple, but we can get an idea of how vital aspects for the Facebook product depend essentially on the implementation of Core Machine Learning.
It may seem the same at first, but the main difference between the duties of a Data Science team and a Core Machine Learning team is that while the former generally focuses on analysis and modeling; the latter places all of this in a way that is scalable and automated within the main core of the business.
However, given that we said Core Machine Learning would ideally be a team/approach that potentially leverages the core business through the application of ML, I will talk a bit about how everything is operationalized.
MLOps - Machine Learning Operations
In Software Engineering there is a very high degree of maturity in the way applications are built and their tools, ranging from excellent IDEs, passing through frameworks that handle inversion of control and dependency injection well, mature and battle-tested development methodologies, deployment tools that greatly simplify the CI/CD process, and observability tools that facilitate application monitoring.
In contrast, in machine learning there is an abyss in terms of maturity regarding the adoption of these practices, as well as the fact that most of the time machine learning engineers work with data artifacts such as models and datasets.
Some of these artifacts (not exhaustive) are:
- Data Science Analyses;
- Data extraction pipelines and feature generation via Data Engineering;
- Versioning of data that generate the models;
- Tracking of model training;
- Tracking of hyperparameters used in experiments;
- Versioning of Machine Learning models;
- Serialization and promotion of models to production;
- Maintenance of privacy of data and the model;
- Training of models considering security countermeasures (e.g. adversarial attacks);
- Monitoring model performance given the intrinsic degradation nature of performance of these artifacts (data/model drift).
One of the consequences of so many distinctions in terms of processes between these areas is that the operationalization of these resources must also be done in a totally distinct way.
In other words, maybe DevOps might not be enough in these cases.
A figure that well summarizes this point is from Luke Marsden’s talk called “MLOps Lifecycle Description” where he places the difference between these two areas as follows:

The idea behind it is that while traditionally software engineering deals with functionality and has code as the materialization of flows; in the Machine Learning Operations (MLOps) approach, besides the same concerns existing, there is an addition of many moving parts such as data, models, and metrics and the operationalization of all these aspects [2].
That is, the operationalization of this development and deployment flow requires a new way of delivering these solutions in an end-to-end manner.
For this, a proposal of how an end-to-end application would be considering these operational aspects of ML is presented in the article “Continuous Delivery for Machine Learning: Automating the end-to-end lifecycle of Machine Learning applications” within the continuous delivery perspective:
Continuous Delivery for Machine Learning (CD4ML) is a software engineering approach in which a cross-functional team produces machine learning applications based on code, data, and models in small and safe increments that can be reproduced and reliably released at any time, in short adaptation cycles.
In the same article, there is also a figure of what an end-to-end flow of a machine learning platform would look like:

And with these new layers of complexity combined with the little education in software engineering on the part of a large portion of data scientists, it becomes clear that the spectrum of potential problems regarding the delivery of machine learning applications becomes much larger.
However, so far we have discussed very high-level aspects such as the impact of ML systems, aspects linked to accountability, Core Machine Learning and its responsibilities, and the MLOps approach.
But I want to deepen the level a bit more and enter into some more specific points where MLOps has a more direct action; that is, shed some light on the dark trail where SysOps, DevOps, Software Engineering, and Data Science generally would not enter.

Source: Christian Collins - shades of mirkwood
Complexity in Machine Learning Systems
In the classic paper Hidden Technical Debt in Machine Learning Systems, there is an image that crystallizes well what a machine learning system really is in relation to complexity and effort for each component of this system:

Hidden Technical Debt in Machine Learning Systems
Even without a direct mention of MLOps, the article has some considerations regarding the specific problems of machine learning systems that would result in technical debts and other problems that would potentially leave these applications more fragile in terms of scalability and maintenance.
I decided to take some of the seven points from the article and give some practical examples. The idea is to show an MLOps approach in some scenarios (hypothetical or not) as we can see below:
The corrosion of boundaries due to complex models
- Is there isolation and encapsulation of the codes that generated the models? If something changes in one of the components, what will be the impact on the performance of the production model? (e.g. data extraction, variable distribution analysis, model training, hyperparameter experimentation, model and API deployment, etc.).
- Imagining that your team made an internal NLP library that uses a set of specific stopwords and stemming, can modifications in these components potentially generate a negative cascade effect on the performance of other models?[3]
- In one of your APIs, you noticed there was an increase in the volume of requests. The reason: another team was hitting the production API with a client application signature to perform an innocent “heartbeat”. Having identified this problem, how to perform the separation of real data and invalid calls to the API for the subsequent retraining of the model?[3]
Data dependencies cost more than code dependencies
- You put into production an ensemble of models in an architecture of the type Local Classifier per Parent Node. Your ensemble depends on a class ontology for a text classification task. What can happen to your model’s performance if the business team changes this ontology?
Feedback Loops
- The product team asked to perform an experimentation strategy with Multi-Armed Bandits with n models. How are data from losing strategies being isolated (i.e. given that strategies affect present data and future training)? Is there any log signature that identifies these records?[3]
- A recommender system returns a list of items ordered by probability in terms of relevance to the user. However, the model’s nDCG is very low. How long would it take for you to know that the reason is because the Front-End, instead of respecting the received ranking from the recommender system, is re-sorting by alphabetical order? What would a test or feedback loop look like between the recommender system and the Front-End in this case?[3]
Anti-Patterns in Machine Learning Systems
- For some reason in the past someone forked Tidyverse and placed it in the core of the ML platform. However, you discover that Tidyverse has a GPLv3 license and now you will have to fill your application with Glue Code to mischaracterize this license not to generate financial liability for your company, or worse: have to obey the obligation to open all the source code of this part of the application.
- A pipeline jungle would look something like the following scenario: (a) data extraction occurring via SQL within a bash script that is stored in a CRON job without any kind of orchestration, monitoring, and version control, (b) which subsequently sends these data to the production S3 where, (c) these raw data are processed by a Zeppelin notebook totally in Scala (which has no version control) which subsequently (d) performs preprocessing and (e) finally does a dump of these data in another S3 directory in the account of… Staging? [3].
- Would the code review process be able to identify an increase in cyclomatic complexity due to the increase in glue code that made it practically impossible to put the model into production? [3]
- Would your code review process be able to catch basic data leakage problems, such as not using Pipelines for Cross-Validation in Scikit-Learn?
- Let’s say the data scientist put Scikit-Learn as a dependency just to calculate RMSE, instead of writing a function in numpy which alone would solve everything. Would your code review catch this abstraction debt?
- In the past the Core Machine Learning team had a very strong preference for Java. This led to a choice of an almost dead Machine Learning library in production. What would the integration of new ML engineers and new technologies look like (e.g. PyTorch, Keras, Scikit-Learn, MLExtend, etc.) given that this old platform carries numerous underlying problems (e.g. Common Smells, cost of knowledge transfer, code rewrite etc.)?[3]
Configuration Debt
- Each of your ML microservices has local logging configurations and sending these logs to an ELK would involve redoing scripts and deployment in all these services.
Dealing with changes in the external world
- The ML system that does anomaly detection/classification and is used for alarm and revenue monitoring is receiving an increase in the volume of requests. However, after some time the system starts firing numerous revenue drop alerts; alerts that activate developers to solve this problem. However, you discover the reason for the alerts: the marketing team ran a non-recurring campaign that non-organically increased revenue, and the classifier “learned” that those revenue levels would be the “new normal”. [3]
- Your recommender system is offering the same items out of catalog for 15 days in a row, bringing not only a horrible user experience but also negatively affecting revenue. The reason? There is no monitoring of data (filebeat) nor application metrics (metricbeat). [3]
Other areas of technical debts related to Machine Learning, such as:
- Data testing debt, due to lack of filebeat;
- Reproducibility debt. Example: the use of a library that, by design, does not allow setting a random seed to ensure experiment reproducibility. [3]
- Process debt, such as recurrent problems between product and ML teams due to the use of project management frameworks that do not take into account the specificities of data teams (here is a good example of how this can be done satisfactorily)
- Cultural debt, which happens when sometimes we have conflicts in basic engineering principles, such as reproducibility, pragmatism, simplicity in platform and model development, observability, and controllability. [3]
The points above were some examples of how machine learning systems carry intrinsic complexities that involve very specific skill sets and that must be taken into account regarding their operationalization.
FINAL CONSIDERATIONS
We are on the march to have more and more Machine Learning systems involved directly or marginally in companies’ core business.
With the increased impact of these systems on people’s lives, society, and businesses, it is a matter of time before we have accountability protocols if something goes out of control; especially in aspects linked to fairness, transparency, and explainability of these systems and algorithms.
Within this, it becomes increasingly clear that the era of the “analyst-with-a-script-on-their-own-machine” has its days numbered when we talk about platforms that have interactivity with people.
While Machine Learning systems do not yet have the same level of software engineering maturity regarding their development, deployment, and operationalization, as well as many specific aspects; perhaps there is an avenue for the growth of what is known today as MLOps, or Machine Learning Operations.
The MLOps approach comes not just to deal with aspects linked to infrastructure or software development, but these teams come to meet a demand, still latent, for eliminating or mitigating the problems and debts intrinsic to the activity of Machine Learning development.
NOTES
[1] - The terms “Data Scientist”, “Systems/Platforms”, “Product Manager”, “Accountability”, and “Fairness” will be used throughout this text.
[2] - For those interested, Luke Marsden wrote a kind of MLOps Manifesto where some of these ideas are present.
[3] - Events of which I was a witness or that happened to me directly.
LINKS AND REFERENCES
- Facebook @ Scale Conference 2017 - Machine Learning Track
- Ingredients for Successful AI Platforms
- On How Machine Learning and Auction Theory Power Facebook Advertising
- Operationalizing ML at Scale
- Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective
- MLOps lifecycle description
- Machine Learning: The High-Interest Credit Card of Technical Debt
- Continuous delivery for machine learning
- Hidden Technical Debt in Machine Learning Systems
- How to deliver on Machine Learning projects
- A developer goes to a DevOps conference
- Hacker News Thread: A developer goes to a DevOps conference
- Data Science is Boring: How I cope with the boring days of deploying Machine Learning
- Data Science Infrastructure and MLOps
- Lessons Learned from Building Scalable Machine Learning Pipelines