Building an ML Pipeline with Kubeflow
Building a Churn Prediction ML Pipeline with Kubeflow
This guide walks you through creating and deploying a machine learning pipeline for churn prediction using Kubeflow Pipelines. We will write our components in Python, package them in Docker containers, deploy them to a Kubernetes cluster, and run and monitor the pipeline using the command line.
Step 1: Define the Project Structure
project_root/
├── components/
│ ├── preprocess/
│ │ └── preprocess.py
│ ├── train/
│ │ └── train.py
│ └── evaluate/
│ └── evaluate.py
├── pipeline.py
└── Dockerfile (for each component)
Step 2: Write Component Scripts
preprocess.py
def preprocess():
print("Preprocessing data...")
with open("/data/processed_data.txt", "w") as f:
f.write("cleaned_data")
if __name__ == '__main__':
preprocess()
train.py
def train():
print("Training model...")
with open("/model/model.txt", "w") as f:
f.write("trained_model")
if __name__ == '__main__':
train()
evaluate.py
def evaluate():
print("Evaluating model...")
with open("/metrics/metrics.txt", "w") as f:
f.write("accuracy: 95%")
if __name__ == '__main__':
evaluate()
Step 3: Dockerize Each Component
# Dockerfile for preprocess component
FROM python:3.9-slim
WORKDIR /app
COPY preprocess.py .
RUN mkdir /data
ENTRYPOINT ["python", "preprocess.py"]
Build and push each image:
docker build -t gcr.io/YOUR_PROJECT/preprocess:latest .
docker push gcr.io/YOUR_PROJECT/preprocess:latest
Step 4: Define Your Pipeline in pipeline.py
import kfp
import kfp.dsl as dsl
def churn_pipeline():
preprocess = dsl.ContainerOp(
name='Preprocess',
image='gcr.io/YOUR_PROJECT/preprocess:latest',
file_outputs={'processed_data': '/data/processed_data.txt'}
)
train = dsl.ContainerOp(
name='Train',
image='gcr.io/YOUR_PROJECT/train:latest',
file_outputs={'model': '/model/model.txt'}
).after(preprocess)
evaluate = dsl.ContainerOp(
name='Evaluate',
image='gcr.io/YOUR_PROJECT/evaluate:latest',
file_outputs={'metrics': '/metrics/metrics.txt'}
).after(train)
if __name__ == '__main__':
kfp.compiler.Compiler().compile(churn_pipeline, 'churn_pipeline.yaml')
Step 5: Deploy to Kubernetes Cluster
Make sure your Kubernetes cluster with Kubeflow installed is running. Push your Docker images to a container registry accessible by the cluster (e.g., Google Container Registry).
Step 6: Upload and Trigger Pipeline via CLI
- Install the Kubeflow SDK:
pip install kfp
- Compile the pipeline YAML:
- Connect to your Kubeflow Pipelines endpoint:
- Create an experiment and run the pipeline:
python pipeline.py
import kfp
client = kfp.Client(host='http://YOUR_KUBEFLOW_ENDPOINT')
experiment = client.create_experiment(name='Churn-Prediction')
run = client.run_pipeline(
experiment_id=experiment.id,
job_name='churn-pipeline-run',
pipeline_package_path='churn_pipeline.yaml'
)
print(f"Pipeline run ID: {run.id}")
Step 7: Monitor and Track the Pipeline
- Open your Kubeflow UI in the browser.
- Navigate to the Pipelines section.
- Click on the experiment
Churn-Prediction
and select the run namedchurn-pipeline-run
. - Track the status of each component (Preprocess, Train, Evaluate).
Step 8: Check Output Artifacts and Results
- From the UI, click each component step to view logs and output paths.
- You can download files such as:
/data/processed_data.txt
– cleaned data/model/model.txt
– serialized model/metrics/metrics.txt
– model evaluation result- These files are available in the output artifacts panel of each component run.
Next Steps
- Automate trigger with Argo Events or scheduled runs.
- Persist outputs with PVC volumes.
- Add monitoring, dashboards, or notifications.
- Extend pipeline with model deployment or explainability.
Comments
Post a Comment