Reproducibility with Docker

Learn how to set up your own Jupyter Notebook server using Docker to capture common environment information and to improve the reproducibility of your analyses within your team.

Introduction

One of the most frustrating issues in the life of a data scientist, especially for those working as part of a larger data team, is dealing with various operating systems, dependencies, inconsistent package versions, etc., which makes reproducing and collaborating on each other's analyses in a notebook extremely difficult.

This is where docker comes in. Docker allows us to get a a ‘ready to go’ Jupyter ‘Data Science’ notebook stack up and running in no time at all in what’s known as a container. These containers virtualise the operating system, making them extremely portable and efficient. It makes it easier to keep your development components and could offer data science libraries out-of-the-box.

In this tutorial, we’re going to show you how to set up your own Jupyter Notebook server using Docker, to capture the environment information in a docker image & how to run notebooks as a docker container. We won't delve into the basics - this tutorial assumes a working knowledge of Jupyter and Docker, assuming you:

  1. Already have a DockerHub account (which is free). Go to https://hub.docker.com/ and register.

  2. Have downloaded docker from docker store and have it installed on your machine.

  3. Are logged in with your docker hub credentials.

If you are unfamiliar with Docker, this tutorial below is pretty awesome for quickly learning the basics and getting Docker up & running on your operating system.

Why use Docker with Jupyter?

Projects and notebooks within your team's scope:

  • Contain non-standard or custom packages

  • Have OS-level dependencies

  • Are run on machines in the cloud

  • Contain environment variables specified in the notebooks

Note that these are especially pertinent points if you are regularly reproducing and collaborating on each other's work.

Docker terminology

There are some basic building blocks of the docker ecosystem that we should clarify before we proceed.

  • Image: This is the executable package that contains the operating system and all files and installed packages within the environment.

  • Container: Formed from a running docker image. The container runs the application itself.

  • Docker Hub: A registry to store docker images. Think of this as a directory of all available Docker images.

  • Docker Daemon: This is the background service that manages the building, running and distributing of Docker containers. It runs the operating system to which clients talk to.

  • a service that runs on your host operating system

  • Docker Client: The CLI that allows us to interact with the daemon.

  • Dockerfile: This is a YAML file that contains a list of commands that the Docker client calls while creating an image.

Dockerizing your teams workflow

Project Jupyter offers various docker images in their Github repository, each of which composed of different libraries and Jupyter kernels. Today, however, we will be working with Kyso's custom docker image for data science and machine learning as an example.

Let's say you have a local directory with your notebooks. Below, we'll go through how to run this directory as a docker container.

There are 2 workflows to consider:

1. Simply running our existing image

Kyso's docker image is available on Docker Hub here.

To simply run Kyso's docker image, execute the following in your directory:

docker run --rm -it -v "$(pwd):/home/jovyan" -p 8888:8888 kyso/jupyterlab

If you want to be able to use sudo within the image (for installing libraries, for example) use:

docker run --rm -it --user root -v "$(pwd):/home/jovyan" -p 8888:8888 kyso/jupyterlab

To avoid receiving any permission denied errors for writing files within the image, run the image with the added environment variable ENV CHOWN_HOME=yes

docker run --rm -it --user root -v "$(pwd):/home/jovyan" -p 8888:8888 -e CHOWN_HOME=yes kyso/jupyterlab

Note that if you plan on running the image locally frequently it would be handy to create an alias in your ~/.bash_profile:

alias kyso="docker run --rm -it --user root -v "$(pwd):/home/jovyan" -p 8888:8888 kyso/jupyterlab"

2. Creating your own image

First we need to create our Dockerfile. Check out a list of the commands you can include here.

The commands we will use in this brief guide are:

FROM to initialise a build stage and sets the Base Image for subsequent instructions.

RUN to execute any commands in a new layer on top of the current image and commit the results.

  • Create your Dockerfile with touch Dockerfileat the base of directory containing your notebooks.

  • Open the file in your editor of choice. Now we can install or remove packages with specified versions, set environment variables, etc. We are going to pull Kyso's image as our base image and append another package we want to work with.

For example, let's say we want to append the image with altair, a declarative statistical visualisation library for Python, based on Vega and Vega-Lite. Our Dockerfile will look like:

FROM kyso/jupyterlab
RUN pip install altair && \
jupyter labextension install @jupyterlab/vega3-extension

This Dockerfile states that the base image is from Kyso. If it doesn’t exist locally, it will be downloaded from the Docker Hub. The RUN command executes pip install altair to install the plotting library and jupyter labextension install @jupyterlab/vega3-extension to add the vega3 JupyterLab extension. Notice the use of a backslash to break a line.

Now there's a few more steps to take:

  • Build the docker image:

docker build . -t image_name

The flag -t image_name is naming the new image. The dot (.) tells docker to use the current directory to look for a Dockerfile. Notice that new layers are created and removed as the lines of the Dockerfile are interpreted.

There must be a new image in the local cache. Now that your docker image is created you can see them with:

docker images
  • A container from this image can be created like so:

docker run -it --name kyso_container image_name
  • But for now, let's just run the image. You can start the Jupyter server with:

docker run --rm -it --user root -v "$(pwd):/home/jovyan" -p 8888:8888 image_name

Jupyterlab will be launched automatically running on the image you have just built and binds the Jupyter port 8888 of the container to port 8888 of the host machine we are running this command on.

You can access your notebooks at localhost:8888 and enter the token provided.

The images created exist only locally, in Docker’s local cache. You may want to share them with other Docker hosts, or with teammates, or even make them public to the world. In any case, you want to publish your images to a Docker registry. They can be published to a cloud-based registry, like the Docker Hub which, by the way, is the default if you don’t explicitly specify the registry. First create a free Docker ID, then login, where you push and pull images from Docker Hub.

docker login

The notation for associating a local image with a repository on a registry is username/repository:tag. The tag is optional, but recommended, since it is the mechanism that registries use to give Docker images a version.

docker tag image_naem username/repository:latest

This puts the image in the chosen repository and tags it as latest.

Run docker image ls to see your newly tagged image.

Then upload your tagged image to the repository:

docker push username/repository:latest

From now on, you can use docker run and run your app on any machine with this command:

docker run --rm -it --user root -v "$(pwd):/home/jovyan" -p 8888:8888 username/repository:latest

Conclusion

In this article, we’ve covered how to extend and run existing images and customise them to your needs using a Dockerfile. We’ve also seen how to publish the images to a Docker registry.

How does your team benefit from your newly dockerized Jupyter projects?

Your team members can now reproduce your analyses and results with no issues when you include the Dockerfile alongside your Jupyter notebooks.

As a best practice, always commit these Dockerfiles along with your notebooks to a VCS like GitHub. You can learn how to sync your Github account with Kyso at the link below:

That’s all! If you find the tutorial useful, do checkout Kyso for publishing and collaborating on your Jupyter notebooks.