Building AutoML Pipelines: Cloud and Local Solutions

Building AutoML Pipelines: Cloud and Local Solutions

Building AutoML Pipelines: Cloud and Local Solutions

Automated Machine Learning (AutoML) streamlines model development by automating tasks like data preprocessing, feature engineering, model selection, and hyperparameter tuning. This tutorial covers two approaches: a cloud-based pipeline using Google Cloud AutoML and a local pipeline using AutoKeras. We’ll also clarify why Google Cloud AutoML cannot be run entirely locally and how AutoKeras fills that gap.

Why Google Cloud AutoML Requires the Cloud

Google Cloud AutoML is a powerful managed service but relies on cloud infrastructure:

  • Architecture: It uses Google Cloud Storage for data and cloud servers for training and model management.
  • APIs: The AutoML client communicates with Google’s servers, not local resources.
  • Limitations: While you can preprocess data or run predictions locally, training and model management require a Google Cloud Platform (GCP) account and internet connection.

For local AutoML, AutoKeras (built on TensorFlow and Keras) runs entirely on your machine, making it ideal for prototyping or cost-free experimentation.

Option 1: Cloud-Based AutoML with Google Cloud AutoML

This section recreates the original pipeline using Google Cloud AutoML for a tabular dataset (e.g., customer churn prediction).

Prerequisites

  • Basic knowledge of Python and machine learning concepts.
  • A Google Cloud Platform (GCP) account.
  • Installed tools: Python 3.7+, Google Cloud SDK, and Python libraries (google-cloud-automl, pandas).

Step 1: Set Up Your Environment

  1. Create a Google Cloud Project:
    • Go to the Google Cloud Console.
    • Create a project (e.g., automl-pipeline-demo).
    • Enable the Cloud AutoML API and Cloud Storage API.
  2. Install Google Cloud SDK:
  3. Install Python Libraries:
    pip install google-cloud-automl pandas
  4. Set Up Authentication:
    • Generate a service account key (IAM & Admin > Service Accounts).
    • Download the JSON key and set the environment variable:
      export GOOGLE_APPLICATION_CREDENTIALS="path/to/your-service-account-key.json"

Step 2: Prepare Your Dataset

  1. Choose a Dataset:
    • Example: A customer churn dataset with columns age, tenure, monthly_charges, and churn (target).
    • Store it as churn_data.csv.
  2. Upload to Google Cloud Storage:
    • Create a bucket:
      gsutil mb gs://your-bucket-name
    • Upload the dataset:
      gsutil cp churn_data.csv gs://your-bucket-name/churn_data.csv
  3. Data Formatting:
    • Ensure the CSV has a header row.
    • The target column (churn) should contain categorical values.

Step 3: Build the AutoML Pipeline

Create a Python script to automate data ingestion, model training, and evaluation.

from google.cloud import automl
import pandas as pd

# Project and dataset settings
PROJECT_ID = "your-project-id"
BUCKET_NAME = "your-bucket-name"
DATASET_NAME = "churn_prediction"
MODEL_NAME = "churn_model"
REGION = "us-central1"

# Initialize AutoML client
client = automl.AutoMlClient()

# Step 1: Create a dataset
def create_dataset():
    dataset_display_name = DATASET_NAME
    dataset = {
        "display_name": dataset_display_name,
        "tables_dataset_metadata": {}
    }
    project_location = f"projects/{PROJECT_ID}/locations/{REGION}"
    response = client.create_dataset(parent=project_location, dataset=dataset)
    print(f"Dataset created: {response.name}")
    return response.name

# Step 2: Import data
def import_data(dataset_name, gcs_uri):
    input_config = {
        "gcs_source": {"input_uris": [gcs_uri]}
    }
    response = client.import_data(name=dataset_name, input_config=input_config)
    print("Data import started...")
    response.result()
    print("Data imported successfully")

# Step 3: Train the model
def train_model(dataset_name):
    model_display_name = MODEL_NAME
    model = {
        "display_name": model_display_name,
        "dataset_id": dataset_name.split("/")[-1],
        "tables_model_metadata": {
            "target_column_spec_name": "",
            "train_budget_milli_node_hours": 1000
        }
    }
    response = client.create_model(parent=f"projects/{PROJECT_ID}/locations/{REGION}", model=model)
    print("Training started...")
    response.result()
    print(f"Model trained: {response.name}")
    return response.name

# Step 4: Evaluate the model
def evaluate_model(model_name):
    model = client.get_model(name=model_name)
    print(f"Model: {model.display_name}")
    print(f"Training time: {model.create_time}")
    evaluation = client.list_model_evaluations(parent=model_name)
    for eval in evaluation:
        print(f"Evaluation metrics: {eval.metrics}")

# Main pipeline
def main():
    dataset_name = create_dataset()
    gcs_uri = f"gs://{BUCKET_NAME}/churn_data.csv"
    import_data(dataset_name, gcs_uri)
    model_name = train_model(dataset_name)
    evaluate_model(model_name)

if __name__ == "__main__":
    main()

Step 4: Run the Pipeline

  1. Update the Script: Replace your-project-id and your-bucket-name.
  2. Execute:
    python automl_pipeline.py
  3. Monitor: Training may take 1-2 hours based on dataset size and budget.

Step 5: Deploy and Predict

  1. Deploy the Model:
    • In the Google Cloud Console, go to AutoML Tables > Models.
    • Select churn_model and click Deploy (10-20 minutes).
  2. Make Predictions:
    from google.cloud import automl
    
    def predict(model_name, input_data):
        prediction_client = automl.PredictionServiceClient()
        payload = {
            "row": {
                "values": [
                    {"number_value": input_data["age"]},
                    {"number_value": input_data["tenure"]},
                    {"number_value": input_data["monthly_charges"]}
                ]
            }
        }
        response = prediction_client.predict(name=model_name, payload=payload)
        for result in response.payload:
            print(f"Predicted class: {result.tables.value.string_value}")
            print(f"Confidence: {result.tables.score}")
    
    # Example usage
    model_name = f"projects/your-project-id/locations/us-central1/models/your-model-id"
    input_data = {"age": 30, "tenure": 12, "monthly_charges": 50.0}
    predict(model_name, input_data)
    

Step 6: Best Practices for Google Cloud AutoML

  • Data Quality: Ensure no missing values or outliers.
  • Feature Engineering: AutoML handles basics, but manual features can boost performance.
  • Hyperparameter Tuning: Adjust train_budget_milli_node_hours.
  • Monitoring: Use Google Cloud Monitoring for performance and costs.
  • Version Control: Store scripts and datasets in a repository (e.g., GitHub).

Option 2: Local AutoML with AutoKeras

For local execution, AutoKeras automates model architecture search and hyperparameter tuning on your machine.

Step 1: Set Up Your Environment

  1. Install Dependencies:
    pip install autokeras pandas tensorflow scikit-learn
  2. Hardware: A decent CPU/GPU and 8GB+ RAM.

Step 2: Prepare Your Dataset

  1. Create or Download Dataset: Use churn_data.csv with columns age, tenure, monthly_charges, churn.
  2. Example CSV:
    age,tenure,monthly_charges,churn
    30,12,50.0,No
    45,24,80.0,Yes
    ...

Step 3: Build the AutoML Pipeline

import pandas as pd
import autokeras as ak
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

def load_data(file_path):
    df = pd.read_csv(file_path)
    df['churn'] = df['churn'].map({'Yes': 1, 'No': 0})
    X = df[['age', 'tenure', 'monthly_charges']]
    y = df['churn']
    return X, y

def split_data(X, y, test_size=0.2, random_state=42):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    return X_train, X_test, y_train, y_test

def train_automl_model(X_train, y_train, max_trials=10, epochs=20):
    clf = ak.StructuredDataClassifier(max_trials=max_trials, overwrite=True)
    clf.fit(X_train, y_train, epochs=epochs, validation_split=0.2)
    return clf

def evaluate_model(clf, X_test, y_test):
    y_pred = clf.predict(X_test).flatten().astype(int)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Test Accuracy: {accuracy:.4f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    return y_pred

def save_model(clf, model_path="automl_model"):
    model = clf.export_model()
    model.save(model_path)
    print(f"Model saved to {model_path}")

def main():
    file_path = "churn_data.csv"
    X, y = load_data(file_path)
    X_train, X_test, y_train, y_test = split_data(X, y)
    clf = train_automl_model(X_train, y_train, max_trials=10, epochs=20)
    evaluate_model(clf, X_test, y_test)
    save_model(clf, "churn_automl_model")

if __name__ == "__main__":
    main()

Step 4: Run the Pipeline

  1. Ensure Dataset: Place churn_data.csv in the script’s directory.
  2. Execute:
    python autokeras_pipeline.py
  3. Output: Shows accuracy and classification report; saves the model.

Step 5: Make Predictions Locally

import tensorflow as tf
import pandas as pd

def load_model(model_path="churn_automl_model"):
    model = tf.keras.models.load_model(model_path)
    return model

def predict(model, input_data):
    input_df = pd.DataFrame([input_data], columns=['age', 'tenure', 'monthly_charges'])
    prediction = model.predict(input_df)
    return "Yes" if prediction[0][0] > 0.5 else "No"

model = load_model("churn_automl_model")
input_data = {"age": 30, "tenure": 12, "monthly_charges": 50.0}
prediction = predict(model, input_data)
print(f"Predicted churn: {prediction}")

Run:

python predict_autokeras.py

Step 6: Best Practices for AutoKeras

  • Data Preprocessing: Handle missing values and categorical encoding.
  • Hyperparameter Tuning: Adjust max_trials and epochs.
  • Hardware: Use a GPU for faster training.
  • Model Export: Save in TensorFlow SavedModel format.
  • Monitoring: Track resource usage to avoid crashes.

Comparison: Google Cloud AutoML vs. AutoKeras

Feature Google Cloud AutoML AutoKeras (Local)
Execution Cloud-based Local machine
Cost Pay-per-use (GCP charges) Free (open-source)
Scalability High (cloud infrastructure) Limited by local hardware
Ease of Use Beginner-friendly, managed Requires some ML knowledge
Customization Limited to Google’s framework Highly customizable
Use Case Large datasets, production Prototyping, small datasets

Conclusion

Google Cloud AutoML offers a robust cloud-based solution for large-scale projects, while AutoKeras enables local AutoML pipelines for cost-free prototyping. This tutorial provides complete pipelines for both, ensuring you can choose the right tool for your needs. Explore other local tools like H2O AutoML or TPOT for more options, and share your AutoML journey in the comments!

Comments

Popular posts from this blog

Risk Management for Data Scientists in Insurance and Finance

Building and Deploying a Recommender System on Kubeflow with KServe

CrewAI vs LangGraph: A Simple Guide to Multi-Agent Frameworks