Docker for Data Science Projects: A Comprehensive Guide

Introduction to Docker

Docker is an open-source containerization platform that allows you to package applications along with their dependencies into lightweight, portable containers. For data science projects, Docker helps in creating reproducible, scalable, and shareable environments without worrying about software inconsistencies across different systems.

Why Use Docker for Data Science?

✅ Environment Consistency – Ensures the same setup across machines.
✅ Dependency Management – Eliminates dependency conflicts.
✅ Portability – Can be deployed anywhere (local machine, cloud, or Kubernetes).
✅ Scalability – Supports distributed computing and cloud deployment.
✅ Collaboration – Share projects easily with team members.

1. Understanding Docker Concepts

Before diving into Docker for Data Science, let’s understand its key components:

Component	Description
Docker Image	A template that contains everything needed to run an application.
Docker Container	A running instance of a Docker image.
Dockerfile	A script that defines how an image is built.
Docker Hub	A cloud repository to store and share images.
Volumes	Persistent storage for data inside containers.
Networks	Enables communication between containers.

2. Installing Docker

Step 1: Install Docker on Your System

Windows/Mac: Download and install Docker Desktop from Docker’s official website.
Linux (Ubuntu Example):

sudo apt update
sudo apt install docker.io -y
sudo systemctl start docker
sudo systemctl enable docker

Verify Installation:

docker --version

3. Creating a Docker Container for Data Science

Step 1: Pull a Pre-built Data Science Image

Instead of setting up everything manually, you can use ready-made images like jupyter/scipy-notebook:

docker pull jupyter/scipy-notebook

Step 2: Run a Jupyter Notebook Container

docker run -p 8888:8888 jupyter/scipy-notebook

Open a browser and go to http://localhost:8888 to access Jupyter Notebook.

4. Building a Custom Docker Image for Data Science

Instead of using a pre-built image, let’s create a custom environment with Python, Jupyter, Pandas, NumPy, and Scikit-learn.

Step 1: Create a `Dockerfile`

A Dockerfile is a script that defines the environment inside the container.

Create a Dockerfile:

# Use an official Python base image
FROM python:3.9

# Set the working directory
WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Expose Jupyter Notebook Port
EXPOSE 8888

# Run Jupyter Notebook
CMD ["jupyter", "notebook", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]

Step 2: Create a `requirements.txt` File

numpy
pandas
matplotlib
scikit-learn
jupyter

Step 3: Build the Docker Image

Run the following command to build an image:

docker build -t my-data-science-env .

Step 4: Run a Container from the Image

docker run -p 8888:8888 my-data-science-env

Now, open http://localhost:8888 in your browser.

5. Managing Data in Docker Containers

By default, data inside a container is lost when the container stops. You can use volumes for persistent storage.

Using a Named Volume

docker run -v my_data:/data -p 8888:8888 my-data-science-env

The -v my_data:/data option mounts a persistent volume.

Mounting a Local Directory

If you want to access local files inside Docker:

docker run -v $(pwd)/data:/app/data -p 8888:8888 my-data-science-env

6. Running Machine Learning Models in Docker

Example: Running a Scikit-Learn Model inside Docker

Create a Python script train_model.py:

import numpy as np
from sklearn.linear_model import LinearRegression

# Generate sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])

# Train model
model = LinearRegression()
model.fit(X, y)

# Print model coefficients
print(f"Model Coefficient: {model.coef_}, Intercept: {model.intercept_}")

Modify the Dockerfile to run this script:

COPY train_model.py /app/
CMD ["python", "/app/train_model.py"]

Build and run the container:

docker build -t ml-training .
docker run ml-training

7. Deploying a Machine Learning API with Flask in Docker

To deploy a ML model as an API, we can use Flask.

Step 1: Create a `app.py` File

from flask import Flask, request, jsonify
import numpy as np
from sklearn.linear_model import LinearRegression

app = Flask(__name__)

# Train a simple model
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])
model = LinearRegression()
model.fit(X, y)

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    prediction = model.predict(np.array(data['features']).reshape(-1, 1))
    return jsonify({'prediction': prediction.tolist()})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Step 2: Modify the `Dockerfile`

FROM python:3.9
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
CMD ["python", "app.py"]

Step 3: Run the API

docker build -t flask-ml-api .
docker run -p 5000:5000 flask-ml-api

Test the API using Postman or Curl:

curl -X POST "http://localhost:5000/predict" -H "Content-Type: application/json" -d '{"features": [6]}'

8. Deploying Docker Containers to the Cloud

You can deploy your Docker containers on:

AWS (ECS, EKS, Lambda)
Google Cloud Run
Azure Container Instances
Kubernetes for orchestration

Example: Deploying to AWS Elastic Container Service (ECS)

aws ecr create-repository --repository-name my-docker-repo
docker tag my-image:latest aws_account_id.dkr.ecr.us-east-1.amazonaws.com/my-docker-repo
docker push aws_account_id.dkr.ecr.us-east-1.amazonaws.com/my-docker-repo