Logging and Structured Events in Machine Learning APIs

Logging and Structured Events in Machine Learning APIs

Working with the operationalization Machine Learning APIs one of the biggest issues that I encounter regardless if it’s green/brown field project is the lack of comprehensive logging or some kind of structured events for further persistence and analysis.

And one of the consequences due to the lack of those mechanisms is that those projects do not have ways to debug or inspect any abnormal internal behavior.

In some cases that I worked/saw/know that such mechanisms exists, most of them rely upon some poor implementations that generate a lot of noise that forces Data Scientists and/or Analytics Engineers to go to what I call “Conformity Hell” that consists in {fetch > parse > conform > join} barely structured logs and database tables from other platforms.

Despite all efforts to make it reliable, the reality is that this is taking a lot of manual effort to generate quite brittle/error-prone results.

Problem: Poor response information

One of the biggest problems that I see right away when I go into an existing project is the impossibility of doing error/diagnostic analysis of the model and/or API service which blocks any attempt to retrain the model and make those services quite opaque from the operation’s perspective. And this is an universal complaint from Data Scientists and Machine Learning Engineers.

However, the first thing I’m going to investigate when I get some of those complaints is how the API response is built. Imagining a scenario of a Text Classification API, what I find most of the time is a response more or less like this:

For any MLE who wants to implement any kind of continuous training pipeline or even perform an error/diagnostic analysis this API response is a real nightmare, since the scarcity of detail makes any subsequent analysis impossible.

Based on this response some fundamental questions will remain unanswered such as:

  • What was the input sent to the model?
  • What did the model answer in terms of probabilities?
  • Was there pre-processing before the data was sent to the model?
  • Was there some sort of A/B test variant or bandit experiments?
  • What is the id of the client platform that was sent that we can cross with its respective database?
  • What is the behavior of the model with long sentences?

I could go on and on in at least 40 different questions, but the point I want to make here is that ML application response payloads cannot be treated as the metal logs (e.g. CPU, memory usage, disk space) that are the main axis of what we call observability because machine learning models are in a rough simplification functions that primarily denote behaviors of trained models wrapped in a system.

In other words, we have the code, the application behavior that carries its own intricacies and complexities, and the model itself that embedded behaviors represented in the data that it was trained and an entire myriad of issues carried with this data.

But first, let’s get a small step back and talk briefly about logging and why it matters.

Logging in SWE

I hope that this does not sound controversial, but I strongly believe that is a the common understanding that logging in Software Engineering is important and critical for real-world live applications regardless of their usage for debugging or for operation.

To bring more points in favor of logging the excelent book from Phil Wilkins called Logging in Action brings some aspects on why we should produce logs as:

  • Debugging
  • Unexpected data values or abnormal conditions occurrin
  • Audit and security
  • Root cause analysis
  • Determining the cause of performance issues
  • Anomaly detection
  • Operational effectiveness and troubleshooting
  • Determine when to trigger subsequent action

However, to increase the signal-noise ratio logs we need to follow some small sets of heuristics to make them comprehensible. Some heuristics are presented in the blogpost Logging is important from Giovanni Collazo where he brings a comprehensive list of what a good log line should have:

  • Log Level
  • Message
  • Gruop
  • Context
  • Timestamp

And if we expand those heuristics with some semantics regarding its characteristics, we can borrow the definitions that Gui brought in the piece Logging in Python like a PRO where he made the following statement regarding of 3 main characteristics of a good log:

They’re descriptive in the sense that they give you a piece of information, they’re contextual because they give you an overview of the state of things at the moment, and finally, they’re reactive because they allow you to take action only after something happened (even though your logs are sent/consumed real-time, there’s not really something you can do to change what just happened).

He even describes his rule of thumb related to when to log that is at the (i) start of operations and flows, (ii) progress during any operation, and (iii) at the end of such operation and flow. The image below illustrates this rule of thumb:

when-to-log

With that said I think for anyone that runs systems in production, those aspects are crucial to ensure that at the time that something happens, the logging will be the first mechanism to check what is going on internally on the API.

For whom is interested in further practitioner/on-the-trenches heuristics for logging I recommend some blogposts from Henrik Warne as Good Logging that brings some useful ideas, the post Great Programmers Write Debuggable Code where he brings some insights why programmers do not write debuggable code, Finding Bugs: Debugger versus Logging and Session-based Logging.

But as with everything in life, there’s no panacea and I think logging has some limitations regarding its transposability to ML applications in general.

Logging and its transposability to ML

I really like logs and I strongly believe in its general net benefit during development and in the daily operations work but after some time in the industry, I just realized that its transportability as the method is not so practical due to several things as:

  • The uncertainty that can be generated by the model (artifact) output cannot be easily transposed into semantic terms of the traditional logging levels, e.g a text classification model receiving out of vocabulary (OOV) words and delivering a wrong classification should be classified in an ERROR, WARNING or CRITICAL level?

  • Related to the previous point, another challenge is to use internal states to evaluate errors during the runtime. I made that in the past and it’s just challenging to think about all potential scenarios that can happen during real-world data accessing an ML model. On top of that if you have internal consumer applications it’s just hard to sync with every back/front-end the team that your old state 1004 - Model Unreliable changed to the status 2002 - Ensemble uncalibrated. Models unreliable due to some new exotic implementation.

  • Differentiation between application(API) logging vs metal/machine logging vs ML specific logging is possible but it will cause log fatigue very soon due to the deluge of information. At least for me, it increases a lot my cognitive load due to this mixed context.

  • The current commercial solutions for ML logging are great in some vanilla scenarios i.e. single (artifact) scikit-learn/PyTorch/TensorFlow models but if you have several models combined in ensembles, or in some hierarchical classification with local classifiers their usage’s quite limited.

  • Diagnostic logging is a blessing that can turn into a curse if used indiscriminately. Will Sargent discussed about the diagnostic logging is great for making systems more visible internally but it tends to converge to some overflow in terms of logging budget (i.e. you will increase the costs in S3/GCS) plus setting your system in a myriad of second-order effects (silent risks) for example non-linear performance issues due to IO/Networking

  • Getting the right cardinality for logging is hard and not everyone used to do it right. I want to get the most granular logline that can help me investigate some API returning the same recommendations for 15 days in a row but at the same time, I’ve been in scenarios where the company used to spend thousands of dollars on useless storage due to a high logging cardinality that just creates noise

  • Sebastian Daschner also made a great case for how logging can bring a lot of second-order problems in a piece called The Case Against Logging that he highlights issues like performance, disk IO, upstream dependencies, and levels

Do not get me wrong I love logs and logging every aspect of an ML API is useful for debugging and following what is going on during the runtime, but in reality, after some point in time I just discovered that some simple journaling via structured events were more useful.

Using Structured Events for Journaling ML API outputs

As usual instead, to try to define something out of my head, I would like to bring the definition of Structured Events from Honeycomb where:

An event is a collection of information about what it took to complete a unit of work. For example, an event might be the process of accepting an HTTP request, doing the required work, and then passing back a response. An event is structured because each field of its data is labelled. Fundamentally, any data that comes into Honeycomb enters as a structured event.

As I will put it a little further in this blog post, events represent a photograph of a request made within a space-time frame in which I can enrich with all the necessary details at once, without having to worry about aggregations, levels or flow transitions.

On top of that, I can do a query, search, and aggregation in those events and it can give me a preemptive information set that will give me almost all that I need to infer and understand the application/system/model behavior and be able to know what I do not know that I don’t know.

In other words, I do not want to have a voyeuristic view of my system watching logs and thinking that it will tell me what is going on and in meantime, those logs are actively trying to make me mentally drunk with verbosity, noise, and triviality during some incident.

Actually, the mf’er will tell me from the inside what is happening in a rich and dynamic manner, and providing the maximum context for any potential troubleshooting. If it’s possible the system will already give me the clues to diagnose before I move one single inch to do anything.

Charity Majors made a great breakdown why structured events in the post called Logs vs Structured Events

What looks like an ideal Structured Event for me?

Since this blog post is way so long, I’ll breakdown each dimension at some point in the future, but a non-exhaustive Structured Event for an ML system for a Text Classification API would be something like this:

Some of the fields are self-describing but some main characteristics of what looks like an idea structured event for me:

  • Plain blob json: Gui Latrova just nailed what I think that a good json log should bring Human readability, easy parsing, virtually no limits to include textual information, filtering, querying, easily unpackable in SQLLite
  • Long names in the dimensions
  • Explicit context and descriptions
  • Request identifiers
  • Client identifiers
  • High Cardinality
  • Request context
  • ML model artifacts info (even experimental setups)
  • Inputs, transformations and outputs info
  • No internal states
  • Artifacts lineage (if it’s possible)

One important note: in most scenarios that we’re dealing with ML models in stateless APIs each event should be generated only and only once and must be saved in whatever external place than the local environment!!! Please send those events to some filesystem on the cloud (s3/GCS), event broker, or some timeseries database that supports JSON Blob but for the sake of your future self PLEASE DO NOT STORE THESE EVENTS LOCALLY in some events_20220922_02_34.json.

Closing Thoughts

Structured Events in ML APIs not only ensure that you can catch ML-specific information from your API service, but can help in model diagnostics and they are the first step to reach a desirable state of observability. Those Structured Events can be used as feedback training data that can be used in some Continuous Training Pipeline, or even in a scenario where you need to roll out several experiments at the same time and have each request tainted with all information needed to understand how the real world is treating your ML models in an API.

References and further reading