livenessProbe and Deadlocks in Machine Learning APIs on Kubernetes
2022 Jan 01livenessProbe in Preventing Deadlocks in Machine Learning APIs on Kubernetes
Weeks ago, during the investigation of an outage in one of our Machine Learning APIs responsible for serving a text classification model to other applications (i.e., model servicing), we noticed that after the containers stalled, their destruction and reconstruction were not happening.
As a result, the API entered a state of total and terminal unavailability, triggering a storm of alerts and messages from other client applications that depend on this service.
The application error itself along with the downtime weren’t even the biggest problems, but rather the fact that the container reconstruction did not happen automatically within Kubernetes.
This is even stranger given the fact that the company where I work has a specific CLI with numerous templates for Kubernetes precisely to simplify development in this type of stack and prevent this kind of scenario from happening with any application.
Knowing that this state of deadlock and/or terminal failure without pod initialization could be a configuration problem (this will become clearer in the next section), we decided to investigate what happened and how to provide a permanent solution to this problem.
But first, we had to know what caused this “hang” in our API that led to the container failure.
The Trigger and the Real Problem
The determining factor that triggered the API crash was caused by a client application managed by another development team, which, due to a configuration error, initiated a sort of “involuntary stress test” on our service.
This caused the client application to send a volume of requests that exceeded our current capacity by more than 1000%; this initially led to an increase in latency in the pod requests and subsequently led the containers to break.
In a way, we always expect some (few) application errors for numerous reasons, but what we didn’t expect was that there would be no remediation through the destruction and replacement of containers preemptively, as we expect with Kubernetes—and that was our real problem.
During troubleshooting, a team member realized that our application did not have a livenessProbe configured. In other words, due to the lack of this parameter in the production configuration template, we were not using this Kubernetes mechanism for monitoring the application’s health (service health).
What is a livenessProbe?
Essentially, a livenessProbe performs a “proof of life” on a container to know when it needs to be restarted.
As the documentation states, “[…] the livenessProbe can help capture a deadlock situation of a running application […] and that restarting a container in such a state can help increase application availability, despite bugs”.
Without this configuration, if the container enters a state of unavailability due to the application (i.e., bugs), K8s will not attempt to bring the container back to an active state (i.e., live) via a container restart. In other words, the application can remain in a state of terminal failure or in zombie mode.
The Fix
The configuration itself is straightforward to do in Kubernetes. In our case, we used the following values:
In the application specifications, we parameterized the livenessProbe as the mechanism to perform the container’s “proof of life”; and we also used the readinessProbe, which would be something like a “proof of readiness”.[2]
The documentation defines the readinessProbe as a mechanism for verifying the readiness of containers in situations where an application is (i) temporarily unable to receive traffic for numerous reasons (ii) and at the same time we do not want the container to be unnecessarily destroyed.
In other words, these are situations where the container is active but not ready, and because of this, we don’t want it to be destroyed due to other factors (e.g., an external situation).
In Machine Learning APIs where one or more models are served, in certain situations, it is not recommended to receive traffic or destroy the container due to numerous factors such as:
-
Model loading into memory taking longer than necessary. For example, depending on the size of the corpora, FastText can generate models up to 3GB;
-
Delay in downloading weights and/or models that may happen to be in another repository (e.g., S3, model registry, or models like VGG that can consume almost 500+ MB);
-
Unavailability in the Feature Store (e.g., if part of the real-time data depends on a Cassandra node that is unavailable for some reason, this can cause additional latency); and
-
Dependency on applications within the same K8s node that are unavailable due to a
CrashLoopBackOffbecause of a setup error after deployment.
The idea here is to illustrate that despite being somewhat simpler than standard software engineering applications, these Machine Learning APIs can have a myriad of reasons for being active but not ready.
For a deeper understanding of these configuration parameters, I suggest reading the links in the references at the end of this post.
Configuration Values and Other Considerations…
Far from establishing any kind of “best practices” given that each ML-API is unique and has its own specifics and context, I wanted to share some of the reasons for choosing part of our configurations as a reference and show what worked or not in our case:
Removing the use of tcpSocket on an established port in livenessProbe and readinessProbe
I personally use tcpSocket only in cases where I need guaranteed execution in the millisecond range. The advantage is that this is a cheap test from a performance standpoint, simpler to implement, and has no coupling with a specific endpoint since the check is only on the open socket, depending only on the availability of a port.
Even with these advantages, I use tcpSocket only in cases where the health-check endpoints and similar are not yet established in their final versions.
I say this because despite its simplicity, this form of probing only informs that the application has the socket available and responsive, but not whether a specific endpoint is returning what it should.
I admit that this is a poor way (anti-pattern) to couple application behavior into a probe that should be only environmental. However, most of the time I am much more interested in the triad {hardware + environment + application} than just the first two.
readinessProbe with HTTP request
This is the configuration I like to use for most standard Machine Learning service applications, and in my view, it provides a result that may not be the most performant from a latency standpoint, but at least provides something much more informative regarding the application’s health.
I always use this configuration along with a health-check endpoint (I like to keep the name always as /health). This depends on each implementation and need, but at least in the cases where I’ve had the opportunity to work, it worked well because I always have the same place to check across all APIs (which greatly reduces my cognitive load due to standardization) and, furthermore, I can condition it to a response that simulates an application flow that emulates the application’s behavior, provided it doesn’t have any kind of external dependency [3].
Permissive initialDelaySeconds in case of rolling deployment strategies
Within various deployment strategies such as Canary Deployment, Blue-Green, Feature Flags, among others, for the applications I’m currently working on, I am satisfied with the rolling deployment strategy [1].
This deployment strategy, despite fast feedback with production, carries a significant risk since a bad deployment can put the entire application into an unavailable state.
In our case, even with extremely fast rolling deployment (and rollback), I am generally a bit more permissive with the initialDelaySeconds values in the livenessProbe and also in the readinessProbe.
This is due to the fact that if either of the two triggers before the application is active or ready, an error will occur, and both will restart the (production) container.
This can lead to a potential post-deployment situation where the previous container is being destroyed (i.e., the current functional production one) but at the same time the new container (i.e., new code) is undergoing an early restart—meaning the application could potentially get stuck in a state of total unavailability.
One mechanism to avoid this type of situation is to use the RollingUpdate strategy in the configuration (Docs), something like the following configuration:
The official documentation is very interesting in this regard and worth reading for a complete understanding.
readinessProbe > livenessProbe
I don’t know if this is a general K8s best practice or not, but empirically what I’ve seen is that the ideal configuration is for the readinessProbe to have an initialDelaySeconds value that is at least equal to or greater than that of the livenessProbe.
This is because in case of any type of traffic unavailability or dependency on other services, or loading a model in the API where this buffer time exists, if the readinessProbe triggers before the livenessProbe, an error might occur.
In other words, the application might be active but not ready to receive requests due to the reasons already mentioned regarding the readinessProbe.
If there is a readinessProbe, I include a livenessProbe
In some applications, for some specific reason during a readiness probe via readinessProbe where the timeout had to be high (e.g., downloading a model from S3 to the API, loading pre-trained model weights into memory, accessing information from another external API, etc.) but where the life probe via livenessProbe was not configured, a situation occurred where the container had already been destroyed or was unavailable and no restart occurred (and the terminal failure remained). Because of the long timeout, this masked an unavailability.
One way to avoid this was to use the livenessProbe to check if the container was alive first, and only after the container is live, then let the readinessProbe enter and analyze readiness. This is even one of the tips in this blog post called Kubernetes Liveness and Readiness Probes: How to Avoid Shooting Yourself in the Foot, where the author uses this same heuristic to prevent pods from being unnecessarily destroyed.
Avoid the use of external applications or complex logic in the readinessProbe
Both livenessProbe and readinessProbe are configurations where, if their absence can cause the problems I mentioned earlier, incorrect use along with external dependencies can cause a myriad of errors.
A classic example of how external dependencies can be harmful in this type of configuration was presented by Henning Jacobs in his post called Kubernetes livenessProbes are dangerous, where if an application uses a database that is experiencing latency higher than the initialDelaySeconds, this can trigger the readinessProbe to restart all pods that have this dependency, triggering cascading failures in other systems that depend on that pod.
Another anti-example [4] would be an application that has a readinessProbe checking the integrity of the model in production via MD5 for compliance reasons.
The biggest problem with using external dependencies in the readinessProbe is that, due to convenience and ease, many people may choose to include application logic as a way to establish a service health monitoring mechanism, and we often forget that this process runs not only at pod initialization but throughout its entire lifecycle; that is, this check will occur as long as the pod is up within the specified interval. The complicating factor is that every pod restart of a production application becomes at the mercy of applications that are not managed and are subject to all the vagaries of the computer gods and Murphy’s Law.
When in doubt, do not put any external service in the readinessProbe. If it is indispensable, ensure that in the code there are ways to remedy the problem with the restart, such as circuit breakers, error handling, etc.
Final Considerations
Although the problem was triggered by an external factor and the root cause was a configuration problem due to the lack of standardization of these settings, it was an interesting journey to discover the problem, and I learned a lot during this troubleshooting. As a future direction, I’ve already included these configurations in our “ML service chassis” standard, which is an idea completely copied from Chris Richardson in his “Pattern: Microservice chassis” and will become a post for another time.
References
-
Kubernetes Liveness and Readiness Probes: How to Avoid Shooting Yourself in the Foot
-
Eliminate DB dependency from liveness & readiness endpoints in Kuma
Notes
[1] Simply put, a rolling deployment is a slow and gradual replacement of an application; in my case, each of the pods will be stopped, destroyed, and replaced by new pods with the new version of the production code. Despite the disadvantages of slowness for small deployments (because the entire application has to be deployed even if only a small piece of code changes) and obviously increasing the risk, I take into account the tests we always perform and the simplified way we have in our rollback processes in case of any problem.
In our specific case, having production feedback as quickly as possible and the simplicity of rollback and implementation makes us satisfied with the choice. And since we basically only implement memory artifacts within the endpoint, we don’t have all the complication of other applications with dependency structures on other classes or components like traditional software engineering.
I gave this introduction to our deployment strategy due to the fact that since we are in rolling deployment, the initialDelaySeconds affects the time the application will be available. Example: If I know that the time for a rolling deployment will be low, I can keep it at 10 seconds—that is, the life proof will be performed only after 10 seconds. If I know my deployment will take longer (e.g., due to downloading an artifact from S3, validating the artifact’s SHA in S3 according to the API version, etc.), I leave this value higher.
[2] The literal translation would be something like “sonda de vida” for livenessProbe and “sonda de prontidão” for readinessProbe in Portuguese. I kept the word “proof” just to simplify the language.
[3] Again, this is not a recommended practice from an API design point of view, but at least in our use cases, it’s not enough for the API to return “something” but to return this something “in working order”. An analogy would be like a doctor who wants to check respiratory function and instead of asking the patient if they are breathing, he uses a stethoscope, places it on the patient’s chest, and verifies it through the sound received from the device.
[4] In cases where the Machine Learning API that will perform the servicing has its model registry decoupled from the experiment/training manager like AWS SageMaker, Algorithmia, MLFlow, and the like; ideally, at the moment the production model is persisted in the registry, the MD5 of the artifact is extracted and this hash is persisted in the application (usually in a setup.py) for reference.
This value can be used in unit tests in a CI/CD pipeline for integrity checking as well as for readinessProbe to know if the artifact in the registry is the same as the one being served in the API.
A proposal for how to perform this check could be something like the following:
In a file named check_integrity.py:
In this case, the readinessProbe would be something like:
It is more than not recommended to use external checks in the readinessProbe because of all the problems already described.