Building an ML Pipeline with Kubeflow

Kubeflow Churn Prediction Pipeline

Building a Churn Prediction ML Pipeline with Kubeflow

This guide walks you through creating and deploying a machine learning pipeline for churn prediction using Kubeflow Pipelines. We will write our components in Python, package them in Docker containers, deploy them to a Kubernetes cluster, and run and monitor the pipeline using the command line.

Step 1: Define the Project Structure

project_root/
├── components/
│   ├── preprocess/
│   │   └── preprocess.py
│   ├── train/
│   │   └── train.py
│   └── evaluate/
│       └── evaluate.py
├── pipeline.py
└── Dockerfile (for each component)

Step 2: Write Component Scripts

preprocess.py

def preprocess():
    print("Preprocessing data...")
    with open("/data/processed_data.txt", "w") as f:
        f.write("cleaned_data")

if __name__ == '__main__':
    preprocess()

train.py

def train():
    print("Training model...")
    with open("/model/model.txt", "w") as f:
        f.write("trained_model")

if __name__ == '__main__':
    train()

evaluate.py

def evaluate():
    print("Evaluating model...")
    with open("/metrics/metrics.txt", "w") as f:
        f.write("accuracy: 95%")

if __name__ == '__main__':
    evaluate()

Step 3: Dockerize Each Component

# Dockerfile for preprocess component
FROM python:3.9-slim
WORKDIR /app
COPY preprocess.py .
RUN mkdir /data
ENTRYPOINT ["python", "preprocess.py"]

Build and push each image:

docker build -t gcr.io/YOUR_PROJECT/preprocess:latest .
docker push gcr.io/YOUR_PROJECT/preprocess:latest

Step 4: Define Your Pipeline in pipeline.py

import kfp
import kfp.dsl as dsl

def churn_pipeline():
    preprocess = dsl.ContainerOp(
        name='Preprocess',
        image='gcr.io/YOUR_PROJECT/preprocess:latest',
        file_outputs={'processed_data': '/data/processed_data.txt'}
    )

    train = dsl.ContainerOp(
        name='Train',
        image='gcr.io/YOUR_PROJECT/train:latest',
        file_outputs={'model': '/model/model.txt'}
    ).after(preprocess)

    evaluate = dsl.ContainerOp(
        name='Evaluate',
        image='gcr.io/YOUR_PROJECT/evaluate:latest',
        file_outputs={'metrics': '/metrics/metrics.txt'}
    ).after(train)

if __name__ == '__main__':
    kfp.compiler.Compiler().compile(churn_pipeline, 'churn_pipeline.yaml')

Step 5: Deploy to Kubernetes Cluster

Make sure your Kubernetes cluster with Kubeflow installed is running. Push your Docker images to a container registry accessible by the cluster (e.g., Google Container Registry).

Step 6: Upload and Trigger Pipeline via CLI

  1. Install the Kubeflow SDK: pip install kfp
  2. Compile the pipeline YAML:
  3. python pipeline.py
  4. Connect to your Kubeflow Pipelines endpoint:
  5. import kfp
    client = kfp.Client(host='http://YOUR_KUBEFLOW_ENDPOINT')
  6. Create an experiment and run the pipeline:
  7. experiment = client.create_experiment(name='Churn-Prediction')
    
    run = client.run_pipeline(
        experiment_id=experiment.id,
        job_name='churn-pipeline-run',
        pipeline_package_path='churn_pipeline.yaml'
    )
    print(f"Pipeline run ID: {run.id}")

Step 7: Monitor and Track the Pipeline

  • Open your Kubeflow UI in the browser.
  • Navigate to the Pipelines section.
  • Click on the experiment Churn-Prediction and select the run named churn-pipeline-run.
  • Track the status of each component (Preprocess, Train, Evaluate).

Step 8: Check Output Artifacts and Results

  • From the UI, click each component step to view logs and output paths.
  • You can download files such as:
    • /data/processed_data.txt – cleaned data
    • /model/model.txt – serialized model
    • /metrics/metrics.txt – model evaluation result
  • These files are available in the output artifacts panel of each component run.

Next Steps

  • Automate trigger with Argo Events or scheduled runs.
  • Persist outputs with PVC volumes.
  • Add monitoring, dashboards, or notifications.
  • Extend pipeline with model deployment or explainability.

Comments

Popular posts from this blog

Risk Management for Data Scientists in Insurance and Finance

Building and Deploying a Recommender System on Kubeflow with KServe

CrewAI vs LangGraph: A Simple Guide to Multi-Agent Frameworks