Docker for Data Science Projects: A Comprehensive Guide
Introduction to Docker
Docker is an open-source containerization platform that allows you to package applications along with their dependencies into lightweight, portable containers. For data science projects, Docker helps in creating reproducible, scalable, and shareable environments without worrying about software inconsistencies across different systems.
Why Use Docker for Data Science?
✅ Environment Consistency – Ensures the same setup across machines.
✅ Dependency Management – Eliminates dependency conflicts.
✅ Portability – Can be deployed anywhere (local machine, cloud, or Kubernetes).
✅ Scalability – Supports distributed computing and cloud deployment.
✅ Collaboration – Share projects easily with team members.
1. Understanding Docker Concepts
Before diving into Docker for Data Science, let’s understand its key components:
Component | Description |
---|---|
Docker Image | A template that contains everything needed to run an application. |
Docker Container | A running instance of a Docker image. |
Dockerfile | A script that defines how an image is built. |
Docker Hub | A cloud repository to store and share images. |
Volumes | Persistent storage for data inside containers. |
Networks | Enables communication between containers. |
2. Installing Docker
Step 1: Install Docker on Your System
- Windows/Mac: Download and install Docker Desktop from Docker’s official website.
- Linux (Ubuntu Example):
sudo apt update
sudo apt install docker.io -y
sudo systemctl start docker
sudo systemctl enable docker
- Verify Installation:
docker --version
3. Creating a Docker Container for Data Science
Step 1: Pull a Pre-built Data Science Image
Instead of setting up everything manually, you can use ready-made images like jupyter/scipy-notebook
:
docker pull jupyter/scipy-notebook
Step 2: Run a Jupyter Notebook Container
docker run -p 8888:8888 jupyter/scipy-notebook
- Open a browser and go to http://localhost:8888 to access Jupyter Notebook.
4. Building a Custom Docker Image for Data Science
Instead of using a pre-built image, let’s create a custom environment with Python, Jupyter, Pandas, NumPy, and Scikit-learn.
Step 1: Create a Dockerfile
A Dockerfile is a script that defines the environment inside the container.
Create a Dockerfile
:
# Use an official Python base image
FROM python:3.9
# Set the working directory
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Expose Jupyter Notebook Port
EXPOSE 8888
# Run Jupyter Notebook
CMD ["jupyter", "notebook", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]
Step 2: Create a requirements.txt
File
numpy
pandas
matplotlib
scikit-learn
jupyter
Step 3: Build the Docker Image
Run the following command to build an image:
docker build -t my-data-science-env .
Step 4: Run a Container from the Image
docker run -p 8888:8888 my-data-science-env
Now, open http://localhost:8888 in your browser.
5. Managing Data in Docker Containers
By default, data inside a container is lost when the container stops. You can use volumes for persistent storage.
Using a Named Volume
docker run -v my_data:/data -p 8888:8888 my-data-science-env
- The
-v my_data:/data
option mounts a persistent volume.
Mounting a Local Directory
If you want to access local files inside Docker:
docker run -v $(pwd)/data:/app/data -p 8888:8888 my-data-science-env
6. Running Machine Learning Models in Docker
Example: Running a Scikit-Learn Model inside Docker
Create a Python script train_model.py
:
import numpy as np
from sklearn.linear_model import LinearRegression
# Generate sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])
# Train model
model = LinearRegression()
model.fit(X, y)
# Print model coefficients
print(f"Model Coefficient: {model.coef_}, Intercept: {model.intercept_}")
Modify the Dockerfile to run this script:
COPY train_model.py /app/
CMD ["python", "/app/train_model.py"]
Build and run the container:
docker build -t ml-training .
docker run ml-training
7. Deploying a Machine Learning API with Flask in Docker
To deploy a ML model as an API, we can use Flask.
Step 1: Create a app.py
File
from flask import Flask, request, jsonify
import numpy as np
from sklearn.linear_model import LinearRegression
app = Flask(__name__)
# Train a simple model
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 6, 8, 10])
model = LinearRegression()
model.fit(X, y)
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
prediction = model.predict(np.array(data['features']).reshape(-1, 1))
return jsonify({'prediction': prediction.tolist()})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
Step 2: Modify the Dockerfile
FROM python:3.9
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
CMD ["python", "app.py"]
Step 3: Run the API
docker build -t flask-ml-api .
docker run -p 5000:5000 flask-ml-api
Test the API using Postman or Curl:
curl -X POST "http://localhost:5000/predict" -H "Content-Type: application/json" -d '{"features": [6]}'
8. Deploying Docker Containers to the Cloud
You can deploy your Docker containers on:
- AWS (ECS, EKS, Lambda)
- Google Cloud Run
- Azure Container Instances
- Kubernetes for orchestration
Example: Deploying to AWS Elastic Container Service (ECS)
aws ecr create-repository --repository-name my-docker-repo
docker tag my-image:latest aws_account_id.dkr.ecr.us-east-1.amazonaws.com/my-docker-repo
docker push aws_account_id.dkr.ecr.us-east-1.amazonaws.com/my-docker-repo