Productionizing Machine Learning Models and taking care about the neighbors

2017 Aug 22

In Movile we have a Machine Learning Squad composed of the following members:

1 Tech Lead (Mixed engineering and computational)
2 Core ML engineers (production side)
1 Data Scientist (with statistical background) - (data analysis and prototyping side)
1 Data Scientist (with computational background) - (data analysis and prototyping side)

As we can see, there are different backgrounds in the team and to make the entire workflow productive and smoothed as possible, we need to get a good fence (a.k.a. crystal clear vision about the roles) to keep everyone motivated and productive. This article written by Jhonatan Morra brings a good vision about this and how we deal with that fact in Movile. Here are some quotes:

One of the most important goals of any data science team is the ability to create machine learning models, evaluate them offline, and get them safely to production. The faster this process can be performed, the more effective most teams will be. In most organizations, the team responsible for scoring a model and the team responsible for training a model are separate. Because of this, a clear separation of concerns is necessary for these two teams to operate at whatever speed suits them best. This post will cover how to make this work: implementing your ML algorithms in such a way that they can be tested, improved, and updated without causing problems downstream or requiring changes upstream in the data pipeline.

We can get clarity about the requirements for the data and production teams by breaking the data-driven application down into its constituent parts. In building and deploying a real-time data application, the goal of the data science team is to produce a function that reliably and in real-time ingests each data point and returns a prediction. For instance, if the business concern is modeling churn, we might ingest the data about a user and return a predicted probability of churn. The fact that we have to featurize that user and then send them through a random forest, for instance, is not the concern of the scoring team and should not be exposed to them.