My Personal Holy Trinity for Machine Learning Reproducibility

Short and direct:

ML Flow
Why I do use? (a.k.a What was my pain?)
One of the most painful situations that I faced was spent a huge time coding doing hyperparameter search and track the whole experimental setup. With ML Flow right now the only thing that I need to do it’s just investing time to pre-process the data and choose the algorithm to train; the model serialization, data serialization, packaging it’s all done by MLFlow. A great advantage it is that the best model can be deployed in a REST/API easily instead to use a customized Flask script.

https://www.youtube.com/watch?v=ek4mJnDw8eE

Caveats: I really love Databricks but I think sometimes they’re so fast in their development (sic.) and this can cause some problems, especially if you’re relying on a very stable version and suddenly with some migration you can lose a lot of work (e.g. RDD to Dataframe) because rewrite things again.

Pachyderm
Why I do use? (a.k.a What was my pain?)
Data pre-processing sometimes can be very annoying and there’s a lot of new tools that actually overpromise to solve it, but in reality, it’s only a over-engineer stuff with a good Marketing (see this classic provided by Daniel Molnar to understand what I’m talking about (minute 15:48))

https://www.youtube.com/watch?v=LTJNnlBBzuw

My main wish in the last 5 years it’s package all dirty SQL scripts in a single place just to execute with decent version control using Kubernetes and Docker and throw all ETLs made in Jenkins to trash (a.k.a embrace the dirty, cold, and complex reality of ETL). Nothing less, nothing more.

So, with Pachyderm I can do that.

Caveats: It’s necessary to say that you’ll need to know Docker and embrace all the problems related, and the bug list can be a little frightening.

DVC
Why I do use? (a.k.a What was my pain?)
ML Flow can serialize data and models. But DVC put this reproducibility in another level. With less than 15 commands in bash git-like you can easily serialize one versioning your data, code, and models. You can put the entire ML Pipeline in a single place and rolling back any point in time. In terms of reproducibility I think this is the best all-round tool.

https://www.youtube.com/watch?v=4h6I9_xeYA4

Caveats: In comparison with ML Flow the navigation over the experiments here it’s a little bit hard tricky and demands some time to get used.