Hackday - Data Science and Docker Working Together

This past weekend was at the wise.io data science hack day and had a great time. The team is clearly intelligent and I really enjoy working and learning in that kind of environment.

We went over Docker and its utility in the data science field. While I understood the docker abstraction (and basic motivation) previously I still didn't feel like I "got it". This hands on experience with Docker really opened my eyes to how the tool is used and the benefit companies are getting form it. It's always nice to hear use cases from real companies solving real problems.

While docker doesn't have a concept of a push to a server (like something like git) the reproducibility of modeling in the data science context is powerful. That is, there is no more dealing with random machines getting results that other machines don't get or having to set up chef or bash scripts to install a ton of dependencies every time you start a server. Docker really abstracts all that away.

You can package up a model in a docker container, go have that run on some data and return some results - quickly. If you change the model, you can know that other people will be able to replicate the results because of the containerization of the model.

On top of that, running automated tests or trying different model parameters in a parallel way is powerful.

I was able to get my own docker image up and running from the wiseio base and will be maintaining on my Github. The primary difference is that I added lxml for simple web-scraping with pandas.

Now I'm using a mac so I have to use boot2docker which seems to work pretty well.

Once I've got it all installed, I ran

boot2docker up

You can get started with the image I creates by cloning it locally and once you've installed docker you simply run make in the github directory and it will build that docker image and save it locally for you. You can see what images you have on your machine with the docker images command.

Now whenever you want to create a repeatable data science experiment, you just navigate to that directory and run:

export IPYTHON\_PASSWORD=DATASCI
alias do-ds='docker run -d -p 80:8888 -v `pwd`:/workspace/ -v `pwd`/data:/workspace/data -e "PASSWORD=$IPYTHON\_PASSWORD" ds-base ; echo "Now go to your browser: http://$(boot2docker ip). The password is $IPYTHON\_PASSWORD" '

Which will set your ipython password as DATASCI and give you the do-ds alias allowing you to start it up in that directory with all the necessary parameters. I just put it into my .zshrc file for future use.

Please note you've got to use backticks when setting this alias is zsh, but not so in bash if I'm not mistaken. I had to learn a little for this SO post.

Finally run

do-ds

You'll see a print out similar to

423940a7f946eed6f92a15df6d59b9eee8187b8cbd45a30e31f019edfcca9d35
Now go to your browser: http://192.168.59.103. The password is DATASCI

Which gives you the container ID and then url and password for accessing the IPython notebook running in that container. That will also have created a "data" directory on your current directory where you can add files that you want to access from the docker container. Now you can write your code, perform your analysis, and it will automatically save those files for work later on on your local machine.

What's nice is you get a good, clean build every time you want to study or model something new.

Here's another example, say that I've got a model that works on some data I have on my local directory and I want to see if it operates differently in docker. Obviously this isn't a real instance of such an example but you can start seeing the power of the docker abstraction.

cat model.py # simple print statement, but this could be a whole model...
print "Testing data..."

docker run -t -i -v ~/Desktop/temp/model.py:/model.py ubuntu:14.04 /bin/sh -c "apt-get install -y python; python model.py;"

Our instructions are basically mount the file model.py into the image, install python, then run python model.py.

Reading package lists... Done
Building dependency tree
...
After this operation, 16.0 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu/ trusty/main libpython2.7-minimal amd64 2.7.6-8 [307 kB]
...
Fetched 3734 kB in 9s (409 kB/s)
...
Testing data...

The container automatically shuts down with model is run.

Finally you can see all the containers you are running by running docker ps and then shut them down with docker stop <container-id> or just by using the name that you'll see at the far right.

Now there's obviously a lot more to docker than just what's above. Integrations like AWS Container Service and AWS EBS seem extremely powerful to help you scale your docker applications. I will definitely continue to use Docker in my data science projects and would encourage anyone that is interested in giving it a try in their own workflow as well.