Machine Learning and the Swiss Cheese Model: Active Failures and Latent Conditions

2020 Apr 06

TL;DR: Problems will always exist. A reflective, systematic posture with an action plan has always been and will always be the way to resolve these problems.

…

I was writing a post about the importance of Post Mortems in machine learning and saw that this specific part was getting longer than the main point of the other post. So I decided to break this post into a specific subject with a bit more focus and detail.

Machine Learning (ML) and Artificial Intelligence (AI) applications are advancing into increasingly critical domains such as medicine, aviation, banking, and investments, among others.

These applications are making decisions daily, automated and at high scale—not only shaping how industries operate but also how people interact with platforms using these technologies.

That said, it is fundamental that the engineering culture in ML/AI increasingly incorporates and adapts concepts such as reliability and robustness, which are obvious in other engineering fields.

One way to achieve this adaptation is to understand the causal aspects that can elevate the risk of system unavailability.

Before proceeding, I recommend reading the post Accountability, Core Machine Learning, and Machine Learning Operations (or its English version) which discusses ML applications in production and the importance of engineering in building these complex systems.

The idea here is to talk about active failures and latent conditions using the Swiss Cheese Model in a simple way. The goal is to show how these two factors are linked in the chain of events leading to unavailability and/or catastrophes in ML systems.

But before that, let’s understand a bit about why understanding failures can be an alternative path to improving reliability, and also about the “success stories” we see every day on the internet.

Survivorship Bias and Learning from Failure

Today on the internet, there is a myriad of information on practically any technical area. With all the hype in ML and its growing adoption, this information materializes in the form of tutorials, blog posts, discussion forums, MOOCs, Twitter, and other sources.

However, an attentive reader might notice a certain pattern in many of these stories: most of the time, they are cases of something that (a) went extremely well, (b) generated revenue for the company, (c) saved X% in efficiency, and/or (d) the new technology solution was one of the greatest technical marvels ever built.

This results in claps on Medium, posts on Hacker News, articles in major technology portals, technical blog posts that become technical references, papers and more papers on Arxiv, talks at conferences, etc.

Right away, I want to say that I am a great enthusiast of the idea that “intelligent people learn from their own mistakes, and wise people learn from the mistakes of others.” These resources, especially technical blog posts and conferences, gather a high level of extremely valuable information from people in the technical trenches.

This bazaar of ideas is extremely healthy for the community as a whole. Furthermore, this bazaar is burying the old model of gatekeeping that some conference consultancies rode for years at the expense of misinformation, causing numerous companies to waste mountains of money. Additionally, this bazaar of ideas is helping to end the harmful cult of technology personalities where anyone can have a voice.

However, what many of these posts, conference talks, papers, and other articles generally don’t mention are the things that went/are going very wrong during the development of these solutions. This is essentially a problem since we are only seeing the final result and not how that result was generated or the failures/errors committed along the way.

Performing a simple reflection exercise, it is even understandable that very few people socialize the mistakes made and lessons learned, given that today, especially with social media, the message becomes much more amplified and distorted.

Admitting mistakes is not easy. Depending on the level of psychological maturity of the person who made the mistake, a mountain of feelings such as embarrassment, inadequacy, anger, shame, denial, etc., can accompany the error. This can lead to psychological problems where a mental health professional might need to support the person who made the mistake.

From a company’s point of view, the image that might remain in terms of public relations is one of corporate disorganization, bad engineering teams, technical leaders who don’t know what they are doing, etc. This can affect, for example, recruitment efforts.

Because of these points above, it implies that (1) perhaps a large part of these problems are happening at this very moment and are simply being suppressed, and (2) perhaps there is a large survivorship bias in these posts/talks/papers.

There is nothing wrong with how companies frame their accounts; however, a bit of skepticism and pragmatism is always good. For every success story, there will always be an infinity of teams that failed miserably, companies that went bust, people who were fired, etc.

But what does all this have to do with failures and why understand their contributing factors?

The answer is: Because your team/solution must first be able to survive catastrophic situations for the success case to exist. Using survival as a motivating aspect to increase the reliability of teams/systems makes understanding errors an attractive form of learning.

And when there are scenarios of small violations, suppression of errors, absence of procedures, incompetence, imprudence, or negligence, things go spectacularly wrong, as in the examples below:

Amazon: Data from a load balancer was deleted, causing an outage in practically an entire AWS region at the time;
Gitlab: A production database deletion led to an 18-hour unavailability with loss of customer data;
Knight Capital: Lack of code review culture allowed an engineer to place parts of code with a business rule outdated by 8 years into production. This led the company to lose $172,222 per second for 45 minutes (or $465 million). The final investigation can be found here on the SEC website;
European Space Agency: A conversion from a 16-bit to a 64-bit number caused an overflow in the rocket’s steering system, triggering a chain of events that caused the rocket’s destruction and a loss of over $370 million; and
NASA: A degradation of engineering culture into a project/political culture and problems with the O-ring seal led to a catastrophic failure that not only cost billions of dollars but also took the lives of the crew. This culture degradation can be seen in Diane Vaughan’s excellent book, The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA, Enlarged Edition.

Of course, these poorly written lines are not an ode to catastrophe or disaster porn.

However, I want to present another point of view: there is always a lesson to be learned from what goes wrong. Companies/teams that maintain an introspective attitude toward the problems that happen, or analyze the factors that may contribute to an incident, not only reinforce a healthy learning culture but also promote an engineering culture more oriented toward reliability.

Moving to the practical point, I will comment on a risk management tool (mental model) called the Swiss Cheese Model, which helps in understanding the causal factors contributing to disasters in complex systems.

The Swiss Cheese Model

If I had to give an example of an industry where reliability is considered a reference, it would certainly be the aviation industry [N2].

For every catastrophic event that occurs, there is a minute investigation to understand what happened and subsequently address the contributing factors and determining factors so that a new catastrophic event never happens again.

In this way, aviation ensures that by applying what was learned from the catastrophic event, the entire system becomes more reliable. It is no coincidence that even with the increase in the number of flights (39 million flights last year, 2019), the number of fatalities has been falling every year.

One of the most used tools in air accident investigation for risk analysis and causal aspects is the Swiss Cheese Model.

This model was created by James Reason in the article “The contribution of latent human failures to the breakdown of complex systems,” where he built its framework (though without direct reference to the term). However, only in the paper “Human error: models and management” does the model appear more directly.

The justification of the model by the author is made considering a scenario of a complex and dynamic system as follows:

Defenses, barriers, and safeguards occupy a key position in the system approach. High-tech systems have many defensive layers: some are engineered (alarms, physical barriers, automatic shutdowns, etc.), others rely on people (surgeons, anesthetists, pilots, control room operators, etc.), and others depend on procedures and administrative controls. Their function is to protect potential victims and assets against local hazards. Most of the time, these layers do this very effectively, but there are always weaknesses.

In an ideal world, each defensive layer would be intact. In reality, however, they are more like slices of Swiss cheese, with many holes—though, unlike the cheese, these holes are continually opening, closing, and shifting location. The presence of holes in any one “slice” does not normally cause a bad outcome. Generally, this can happen only when the holes in many layers momentarily align to permit a trajectory of accident opportunity—bringing hazards into damaging contact with victims.

Human error: models and management

A visualization of this alignment can be seen in the graph below:

Source: Understanding models of error and how they apply in clinical practice

In this case, each slice of Swiss cheese would be a line of defense with engineered layers (e.g., monitoring, alarms, code push locks in production, etc.) and/or procedural layers involving people (e.g., cultural aspects, committer training and qualification, rollback mechanisms, unit and integration tests, etc.).

Within what the author proposed, each hole in any of the cheese slices happens due to two factors: active failures and latent conditions, where:

Latent conditions are like situations intrinsically residing within the system, which are consequences of decisions made in design, engineering, by those who wrote the rules or procedures, and even from the highest hierarchical levels of an organization. These latent conditions can lead to two types of adverse effects: error-provoking situations and the creation of vulnerabilities. That is, the solution has a design that elevates the probability of events with high negative impact, which can be equivalent to a causal or contributing factor.
Active failures are unsafe acts or small transgressions committed by people in direct contact with the system—acts such as slips, lapses, distortions, omissions, mistakes, and procedural violations.

If latent conditions are linked to engineering and product aspects, active failures are much more related to human factors. A great framework for analyzing human factors is the Human Factors Analysis and Classification System (HFACS).

The HFACS posits that human failures in complex socio-technical systems happen at four different levels, as seen in the image below:

Source: Human Factors Analysis and Classification System (HFACS)

The idea here is not to discuss these concepts but to draw a parallel with machine learning where some of these aspects will be addressed. For those who want to know more, I recommend reading the HFACS for an in-depth study of the framework.

Since we have some clear concepts of active failures and latent conditions, let’s perform a reflection exercise using some examples from ML.

Managing Active Failures and Latent Conditions in Machine Learning

To transpose these factors to the ML arena in a more concrete way, I will use some examples of what I have seen happen, what has happened to me, and some points from Sculley, David, et al.’s excellent article “Hidden technical debt in machine learning systems.” for educational purposes.

In general, these sets of factors (non-exhaustive) would be represented as follows:

Latent Conditions

Absence of a Code Review culture, where analysis/report code, model training, and/or APIs go to production without even a second critical look to evaluate and approve what is being deployed (e.g., model training code involving variables like gender and race, unintelligible code or code without a linter, model training code that is difficult to maintain, or failure to detect serious logical problems as in the London Whale and Knight Capital episodes).
Culture of improvised technical arrangements (workarounds): Using improvised technical arrangements (hacks) is extremely necessary in some situations. However, a culture oriented toward workarounds [N3] in a field with intrinsic complexities like ML tends to include potential weaknesses in ML systems and makes identifying and correcting errors much slower.
Absence of monitoring and alerting: In ML platforms, some factors need specific monitoring, such as data drift (change in the distribution of input data for training), model drift (model degradation relative to predicted data), and adversarial monitoring, which ensures the model is being tested for information gathering or adversarial attacks.
Resume-Driven Development or RDD, is when engineers or teams implement a tool in production just to have it on their resume, potentially prospecting a future employer. RDD’s main characteristic is creating unnecessary difficulty to sell non-existent ease if the right thing had been done.
Democracy-style decisions with less informed people instead of consensus among experts and risk-takers: The point here is simple: key decisions can only be taken by (a) those directly involved in building and operating the systems, (b) those financing and/or taking the risk, and (c) those with the technical skill level to know the pros and cons of each aspect of the decision. The reason is that these people have at least their own skin in the game or know the strengths and weaknesses of what is being addressed. Fabio Akita has already made an interesting argument in this line showing how bad it can be when people without skin in the game and poorly informed are making decisions. Democracy in practicing professions does not exist. This collectivist corporate neo-democracy has no face, and therefore no accountability if something goes wrong. Democracy in technical aspects in the terms placed above is a latent condition. Something wrong will never be right just because a majority decided.

Active Failures

Unreviewed code going to production: Unlike good traditional software engineering where there is a code review layer to ensure everything meets quality standards, in ML this is a theme that still has much to mature, given that many Data Scientists do not have a programming and version control background. Another point that makes this difficult is that in the data scientist’s workflow, many of the tools used make code review almost impossible (e.g., Knit for R and Jupyter Notebook for Python).
Data Leakage in model training: Mixing test/validation samples in training is one of the most common active failures, especially when using Cross Validation. In this post by Devin Soni, he explains other aspects that can cause leakage, such as duplicate data and implicit leakage with temporal data. A practical example of how to combat data leakage can be seen here on Kaggle in an excellent tutorial.
Lack of reproducibility/replicability: This can range from not assigning a random seed before splitting training/test sets and training the model, to solutions that, by design, do not allow a certain degree of reproducibility.
Glue code: In this category, I place the code we write during prototyping and MVP that goes to production in the same way it was created. One thing I’ve seen happen a lot in this sense was having applications with dependencies on numerous packages that required a lot of glue code for minimal “integration.” The code became so fragile that a change in a dependency (e.g., a simple source code update) broke practically the entire API in production.

An Unavailability Scenario in an ML System

Let’s imagine that a fictitious financial company called “Leyman Brothers” had an outage where its stock trading platform was unavailable for 6 hours, causing massive losses for some investors.

After building a proper Post-Mortem, the team reached the following narrative regarding the determining and contributing factors in the outage:

The reason for the outage was due to an out of memory error caused by a bug in the ML library.

This error is known by the library’s developers and there has been an open ticket about the problem since 2017, but it has not been solved to date (Latent Condition).

Another aspect verified was that the response and solution time was excessively long because there were no alerting, heartbeating, or monitoring mechanisms on the ML platform. Thus, without diagnostic information, the problem took longer than necessary to be corrected (Latent Condition).

During debugging, it was found that the developer responsible for implementing the code segment where the error originated knew of correction alternatives but did not do so because the correction would involve implementing another library in a programming language he did not master, even though this language was already being used in other parts of the technology stack (Active Failure).

Finally, it was also seen that the code went directly to production without any type of review. The Github project does not have any “locks” to prevent unreviewed code from entering production (Active Failure due to Latent Condition).

Transposing the event from the narrative to the Swiss Cheese model, we would visually have the following image:

Source: Adapted from Understanding models of error and how they apply in clinical practice

In our Swiss Cheese, each slice would be a layer or line of defense where we have aspects such as system architecture and engineering, the technology stack, specific development procedures, the company’s engineering culture, and finally, people as the last safeguard.

The holes, in turn, would be the failing elements in each of these defense layers, which can be active failures (e.g., committing directly to master because there is no Code Review) or latent conditions (e.g., the ML library, lack of monitoring and alerting).

In an ideal situation, after an unavailability event, all latent conditions and active failures would be addressed, and there would be an action plan for solving the problems so that the same event never happened again in the future.

Despite the high-level narrative, the main point is that outages in complex and dynamic systems never happen due to an isolated factor, but rather due to the conjunction and synchronization of latent conditions and active failures.

FINAL CONSIDERATIONS

Of course, there is no panacea for risk management: some risks and problems can be tolerated, and many times the necessary time and resources for applying adjustments do not exist.

However, when we talk about mission-critical systems using ML, it is clear that there is a myriad of specific problems that can happen beyond natural engineering problems.

The Swiss Cheese model is a risk management model widely used in aviation that offers a simple way to list latent conditions and active failures in events that can lead to catastrophic failures.

Understanding contributing and determining factors in failure events can help eliminate or minimize potential risks and consequently reduce the impact on the chain of consequences of these events.

NOTES

[N1] - The objective of this post is exclusively to communicate with Machine Learning Engineering, Data Science, Data Product Management, and other areas that truly have a culture of improvement and continuous feedback. If you and/or your company understand that concepts of quality, robustness, reliability, and learning are important, this post is dedicated especially to you.

[N2] As this article was being reviewed, this story appeared about the new Boeing 787 plane which, because the core system cannot eliminate obsolete data (data flush) from some critical system information affecting airworthiness, must be powered down every 51 days. That’s right, a Boeing plane needs the same type of “have you tried turning it off and on again?” reboot to prevent a catastrophic event. But this shows that even with a latent condition, it is possible to operate a complex system safely.

[N3] Hack Culture + eXtreme Go Horse (XGH) + Jenga-Oriented Architecture = Outage Powerhouse.

[N4] - Special thanks to Captain Ronald Van Der Put from the Teaching for Free channel for the kindness of providing me with materials related to aviation safety and accident prevention.