How to build your own Data Science environment with Docker and Anaconda3

Container Image

Anaconda is an amazing tool and all-in-one package for scientific computing data sciencemachine learning applications, large-scale data processing, predictive analytics, etc. It is ready-to-use with a lot of major data science packages suitable for Windows, Linux, and macOS. This article aims to complete your own Anaconda docker image and push on DockerHub for your use. You will be able to spin up a container and Jupyter Notebook will be accessible on your browser at last. You need basic understanding of Linux and Docker. It’s more better if you know some commands of Linux and docker as well.

Why Anaconda?

You can avoid tedious works to spin up your data science environment by using Anaconda. Anaconda has more than 1,500 packages for Python & R and supports isolation of environments with conda commands. Preparing required packages and isolation of environments was time-consuming work. Please remember how we did, virtualize your project and install all required packages in place then dependency hell happens, ok go to web and research, then something breaks now, let’s try from scratch then other issues occur now!! But now it’s mostly zero cost with Anaconda distribution. If you don’t focus on data science Pipenv is also a great tool to support package management and interpreter switch. It generates the ever-important Pipfile.lock, which is used to produce deterministic builds. You can think this is like Nodejs. Anaconda distribution is definately the first choice if we need data science and machine learning environment with minimum cost.

Then Why Docker?

The preparing cost dramatically decreased with Anaconda distribution for data science environment. Then Why Docker came in? Docker is a containerization technology and there are a lot of benefits to improve application agility and modernize traditional infrastructure in DevOps context. But does it matter for our use? The reasons I chose Docker were idempotent of build and portability of container. An image build of Docker is idempotent with Dockerfile so a containerized image must work even if you have some additional layers. Portability is also important. If you upload your image in repository you’ll be able to work anywhere just by pulling that image and run it.

  • Portability between different platforms and clouds
  • No need to prepare hypervisor layer to run like VMs
  • Image layers and writable layers are efficient
  • Integration with container orchestration (Kubernetes)
  • Efficiency for continuous integration and continuous delivery
  • Dockerfile and image are sharable in github or Dockerhub

In general there are a bunch of benefits of containerization by using Docker. This is a good reference.

The Benefits of Containerization and What It Means for You

When compared to VMs a containerization is still so different. You may think if you have a VM for Anaconda3 that would be fine. But if you have an Docker image of Anaconda3 you’ll get some capabilities such as integration with other containers in docker-compose, portability by putting your image in DockerHub, customizability by adding layers in image. These should be unleashed powers of containerization, which aren’t ones of VMs.

On the other hand these capabilities brought us into modernized application and infrastructure world, yet hard to understand fully. People can enhance application delivery with continuous integration and continuous delivery efficiently with containerization. A container now becomes disposable layer to deliver frequent and reliable application code changes.

In this article I pick Anaconda3 for just data science use not application delivery so this does not aim to run an image in one of container orchesration services such as AWS EKS or Kubernetes, nor to integrate an image with continuous integration and continuous delivery pipeline for application delivery. A container should not be disposal and your container can be modified and customized.

Anaconda3 Official image

If you’ve had Docker software on your computer or server, it’s quite straightforward to pull Anaconda3 official image and run Anaconda3 yourself. Here’s the official site. continuumio/anaconda3

A base environment has been activated once you run a container from DockerHub. You’ll need to run conda update command to update needed packages to the required verions such as Numpy/Scipy/pandas/matplotlib/scikit-learn.

–allow-root option was required to open a Jupyter notebook without erros in my case.

After notebook version 4.3 a token-base authentication is enabled or a password can be used as an alternative. You can issue jupyter notebook list command to get the accessible URL with your token in the running container. If you set a password the hashed password will be stored in jupyter_notebook_config.json file.

Security in the Jupyter notebook server

Customize Dockerfile

Dockerfile is not a must to use for Docker container, for example you can just add image layers on the existing Anaconda3 image and commit it as new image without any harms. However it’s beneficial to use Dockerfile to manage your image as a file in repository (like in github) and associate your repo with DockerHub to automate image build. It’s called automated builds in DockerHub.

In this section I’d like to focus on the following procedures to create own data science environment.

  1. Include Jupyter Notebook installation and open a notebook
  2. Install own libraries that you need in an image

I created one for myself based on the official Dockerfile and this image has Python libraries Ta-Lib, backtesting, pandas-highcharts, mpl_finance, pandas-datareader, optuna and py-xgboost. Here is the example.

I created docker-entrypoint.sh to open a notebook with the default port 8888 and set it as ENTRYPOINT in Dockerfile above. Please create this file in the same folder or directory with Dockerfile.

How to add libraries that you need?? Please replace the mentioned libraries with pip install and conda install at the lines 24 and 25 with the libraries you want additionally.

Build and push your image

Now we’re ready to build an image from the created Dockerfile and docker-entrypoint.sh. Please place these two files in the same directory and build your image with the following commands.

You can also mount local directory to /opt/notebooks where Jupyter notebook software takes notebook ipynb files. Once you’ve mounted the volume all files in that local directory can be seen in Jupyter notebook with read & write privileges.

At last you can push your image to DockerHub and make it usable anytime anywhere if you want.

Anaconda is useful and a de-facto standard tool for data science projects nowadays. With the portability of containerization you’ll get your own data science environemnt wherever you are, by pulling own image from DockerHub and a container will be exactly with full of libraries that you installed for your needs.

Here’s the github repository for your reference.

https://github.com/yuyasugano/anaconda3

Leave a Reply

Your email address will not be published. Required fields are marked *