Docker Environment for Data Analysis

2020 Oct 27

Docker for Data Analysis

TL;DR: Reproducibility ensures not only more transparency, but is an essential tool for any “Science” work.

…

Throughout my career, I have been fighting the good fight in favor of reproducibility in data science within the corporate world.

Whether talking about some tools I use daily, mentioning that even the mighty scientific journal Nature approves research that wouldn’t pass a senior thesis, or how reproducibility helps in detecting methodological problems.

Finally, I think that reproducibility not only ends the culture of opinion but also unmasks a lot of people with many opinions and little data, given that the data and the code speak for themselves.

In this post, I will share what I use daily for data analysis, prototyping, and the like, and I hope it helps anyone moving toward greater reproducibility in their analyses.

Much of what I use is directly linked to Danny Janz’s post, “Containerize your whole Data Science Environment (or anything you want) with Docker-Compose”. For those who want more details regarding Docker images, I highly recommend that post.

File Structure

I use the following directory structure to generate the environment for each analysis:

analysis
|- src
|  - data
|- Dockerfile
|- .dockerignore
|- docker-compose.yml
|- README.md
|- requirements.txt

There are many great project structure templates like the excellent Cookiecutter Data Science, but in my case, I prefer starting with as few folders as possible for personal organization. Of course, depending on the project, there might be a need for something more structured from the start.

Still regarding directories, I usually put the data inside the src folder.

I know this is an anti-pattern, but besides making navigation through sub-directories easier, I have the advantage of using DVC for data versioning (the data itself stays in AWS S3 and DVC records only the versioning pointers), and the DVC files stay organized in a single directory.

But let’s get to the other files.

Dockerfile

In the Dockerfile, I use python:3.6-buster as the base image.

My choice for Buster was (i) due to the distribution’s stability, (ii) the fact that there will still be long-term support for Python 3.6, and (iii) even though the image isn’t a masterpiece regarding disk space compared to the dreadful Alpine, at least it won’t have the absurd image size of ubuntu:18+ versions.

Since I run some of these pipelines in Gitlab/Github Actions and Amazon ECS, I have some performance criteria for deployment (i.e., the deployment cannot take too long). python:3.6-buster delivers an interesting balance between distribution and image creation speed.

The article by Itamar Turner-Trauring called “The best Docker base image for your Python application (April 2020)” helped a lot in my choice. However, since I had some issues in the past using slim images, I ended up going with the full buster image.

(By the way, his blog is sensational for anyone working with Python + Docker and has an interesting benchmark on these images.)

The rest of the Dockerfile is nothing special, except that for local development, I run Jupyter Notebook with the default options:

.dockerignore

Alexei Ledenev wrote a post about the dangers and problems of not using .dockerignore.

Essentially, he points out that not using .dockerignore not only makes the build slow but its absence can potentially cause environment variable and credential leaks.

This is the .dockerignore I use:

requirements.txt

In requirements.txt, I maintain a canonical set of things I generally use for exploratory data analysis like jupyter, matplotlib, pandas, and numpy.

If I need to train some baseline algorithm, I include sklearn.

Since part of my data is in AWS S3 and MySQL databases, I include the boto3 and s3fs packages by default to access data in S3, and PyMySQL if I need to extract data via MySQL queries.

Two points about requirements.txt that I like to highlight: (i) maintaining development and production versions and (ii) the importance of explicitly passing the pip package version.

I always keep the Docker versions identical to what’s in production, no matter how large the package is.

This ensures from minute zero of the analysis that I won’t have environment problems with different libraries.

This guarantees environment consistency (operating system + libraries), which will prevent me from making errors like the one committed by the University of Hawaii regarding the use of the glob library.

In that case, glob presents differences in sorting behavior (and consequently the sequence of file reading) according to the operating system.

Due to a lack of attention from researchers who took the script from the original study written 6 years earlier (a time when the operating system handled sorting and the library didn’t have the sorting function), the published study’s version started presenting differences and invalidated part of the experiments.

Final result: More than 100 articles had to be corrected.

Lesson: Always use the version explicitly in requirements.txt.

The other point about requirements.txt is that since I use some packages that have some “obscure” dependencies (for lack of a better name), I am always looking at vulnerability reports from snyk, Safety, and Bandit.

For those who don’t have a budget for some of these tools, Github offers the same type of service.

It takes a bit more work because if any change has to be put into production, it has to be transposed to the development environments.

At least for me, it works well because I have mental control over everything that needs to be checked to ensure environment consistency.

docker-compose.yml

The first question that might appear is: “Why use docker-compose?”.

In my case, it was a need to perform prototyping in tools outside the Data Science spectrum, such as Redis for testing caching with offline predictions/scores, SQLite when I needed to test data engineering things for embedding a database, or when I had to use H2O.ai for a specific machine learning training.

In these cases, I only needed to choose an image from these platforms (or create my own) to have all services isolated, each in its own container. Goodbye dependency hell.

One last thing I wanted to highlight regarding the use of configurations and credentials in environment variables: avoid using environment variables in Docker as much as possible, given the clear security issues as seen in Diego Monica’s post, “Why you shouldn’t use ENV variables for secret data”.

I recommend using Docker Secrets for credential management. Personally, only when I need to pass something that isn’t worth storing in Docker Secrets, I pass it in the form shown in the file below.

README.md

Finally, in README.md, I place instructions like what the solution is, what the data is, the Jira ticket, how to build the image, and how to start the container in Docker Compose.

export ENV_VAR_1='***********' && \
export ENV_VAR_2='***********' && \
docker build -t data_science_analysis . &&  \
docker-compose up

Lastly, I go to Jupyter Notebook at http://localhost:8888/ with the password of my choice. In this case, I set the Jupyter Notebook password to root just for demonstration purposes.

Final Considerations

Personally, I think of Docker as a tool that revolutionized the way of doing data science concerning aspects like reproducibility and environment management.

The configuration above is not definitive, but it’s the way I found to move faster in prototyping and analysis, and I hope it helps anyone starting to incorporate Docker into their data analyses.