Building AutoML Pipelines: Cloud and Local Solutions

Automated Machine Learning (AutoML) streamlines model development by automating tasks like data preprocessing, feature engineering, model selection, and hyperparameter tuning. This tutorial covers two approaches: a cloud-based pipeline using Google Cloud AutoML and a local pipeline using AutoKeras. We’ll also clarify why Google Cloud AutoML cannot be run entirely locally and how AutoKeras fills that gap.

Why Google Cloud AutoML Requires the Cloud

Google Cloud AutoML is a powerful managed service but relies on cloud infrastructure:

Architecture: It uses Google Cloud Storage for data and cloud servers for training and model management.
APIs: The AutoML client communicates with Google’s servers, not local resources.
Limitations: While you can preprocess data or run predictions locally, training and model management require a Google Cloud Platform (GCP) account and internet connection.

For local AutoML, AutoKeras (built on TensorFlow and Keras) runs entirely on your machine, making it ideal for prototyping or cost-free experimentation.

Option 1: Cloud-Based AutoML with Google Cloud AutoML

This section recreates the original pipeline using Google Cloud AutoML for a tabular dataset (e.g., customer churn prediction).

Prerequisites

Basic knowledge of Python and machine learning concepts.
A Google Cloud Platform (GCP) account.
Installed tools: Python 3.7+, Google Cloud SDK, and Python libraries (google-cloud-automl, pandas).

Step 1: Set Up Your Environment

Create a Google Cloud Project:
- Go to the Google Cloud Console.
- Create a project (e.g., automl-pipeline-demo).
- Enable the Cloud AutoML API and Cloud Storage API.
Install Google Cloud SDK:
- Follow the official guide.
- Authenticate with:
```
gcloud auth login
```
Install Python Libraries:
```
pip install google-cloud-automl pandas
```
Set Up Authentication:
- Generate a service account key (IAM & Admin > Service Accounts).
- Download the JSON key and set the environment variable:
```
export GOOGLE_APPLICATION_CREDENTIALS="path/to/your-service-account-key.json"
```

Step 2: Prepare Your Dataset

Choose a Dataset:
- Example: A customer churn dataset with columns age, tenure, monthly_charges, and churn (target).
- Store it as churn_data.csv.

Upload to Google Cloud Storage:

Create a bucket:
```
gsutil mb gs://your-bucket-name
```

Upload the dataset:

gsutil cp churn_data.csv gs://your-bucket-name/churn_data.csv

Data Formatting:
- Ensure the CSV has a header row.
- The target column (churn) should contain categorical values.

Step 3: Build the AutoML Pipeline

Create a Python script to automate data ingestion, model training, and evaluation.

from google.cloud import automl
import pandas as pd

# Project and dataset settings
PROJECT_ID = "your-project-id"
BUCKET_NAME = "your-bucket-name"
DATASET_NAME = "churn_prediction"
MODEL_NAME = "churn_model"
REGION = "us-central1"

# Initialize AutoML client
client = automl.AutoMlClient()

# Step 1: Create a dataset
def create_dataset():
    dataset_display_name = DATASET_NAME
    dataset = {
        "display_name": dataset_display_name,
        "tables_dataset_metadata": {}
    }
    project_location = f"projects/{PROJECT_ID}/locations/{REGION}"
    response = client.create_dataset(parent=project_location, dataset=dataset)
    print(f"Dataset created: {response.name}")
    return response.name

# Step 2: Import data
def import_data(dataset_name, gcs_uri):
    input_config = {
        "gcs_source": {"input_uris": [gcs_uri]}
    }
    response = client.import_data(name=dataset_name, input_config=input_config)
    print("Data import started...")
    response.result()
    print("Data imported successfully")

# Step 3: Train the model
def train_model(dataset_name):
    model_display_name = MODEL_NAME
    model = {
        "display_name": model_display_name,
        "dataset_id": dataset_name.split("/")[-1],
        "tables_model_metadata": {
            "target_column_spec_name": "",
            "train_budget_milli_node_hours": 1000
        }
    }
    response = client.create_model(parent=f"projects/{PROJECT_ID}/locations/{REGION}", model=model)
    print("Training started...")
    response.result()
    print(f"Model trained: {response.name}")
    return response.name

# Step 4: Evaluate the model
def evaluate_model(model_name):
    model = client.get_model(name=model_name)
    print(f"Model: {model.display_name}")
    print(f"Training time: {model.create_time}")
    evaluation = client.list_model_evaluations(parent=model_name)
    for eval in evaluation:
        print(f"Evaluation metrics: {eval.metrics}")

# Main pipeline
def main():
    dataset_name = create_dataset()
    gcs_uri = f"gs://{BUCKET_NAME}/churn_data.csv"
    import_data(dataset_name, gcs_uri)
    model_name = train_model(dataset_name)
    evaluate_model(model_name)

if __name__ == "__main__":
    main()

Step 4: Run the Pipeline

Update the Script: Replace your-project-id and your-bucket-name.
Execute:
```
python automl_pipeline.py
```
Monitor: Training may take 1-2 hours based on dataset size and budget.

Step 5: Deploy and Predict

Deploy the Model:
- In the Google Cloud Console, go to AutoML Tables > Models.
- Select churn_model and click Deploy (10-20 minutes).

Make Predictions:

from google.cloud import automl

def predict(model_name, input_data):
    prediction_client = automl.PredictionServiceClient()
    payload = {
        "row": {
            "values": [
                {"number_value": input_data["age"]},
                {"number_value": input_data["tenure"]},
                {"number_value": input_data["monthly_charges"]}
            ]
        }
    }
    response = prediction_client.predict(name=model_name, payload=payload)
    for result in response.payload:
        print(f"Predicted class: {result.tables.value.string_value}")
        print(f"Confidence: {result.tables.score}")

# Example usage
model_name = f"projects/your-project-id/locations/us-central1/models/your-model-id"
input_data = {"age": 30, "tenure": 12, "monthly_charges": 50.0}
predict(model_name, input_data)

Step 6: Best Practices for Google Cloud AutoML

Data Quality: Ensure no missing values or outliers.
Feature Engineering: AutoML handles basics, but manual features can boost performance.
Hyperparameter Tuning: Adjust train_budget_milli_node_hours.
Monitoring: Use Google Cloud Monitoring for performance and costs.
Version Control: Store scripts and datasets in a repository (e.g., GitHub).

Option 2: Local AutoML with AutoKeras

For local execution, AutoKeras automates model architecture search and hyperparameter tuning on your machine.

Step 1: Set Up Your Environment

Install Dependencies:

pip install autokeras pandas tensorflow scikit-learn

Hardware: A decent CPU/GPU and 8GB+ RAM.

Step 2: Prepare Your Dataset

Create or Download Dataset: Use churn_data.csv with columns age, tenure, monthly_charges, churn.

Example CSV:

age,tenure,monthly_charges,churn
30,12,50.0,No
45,24,80.0,Yes
...

Step 3: Build the AutoML Pipeline

import pandas as pd
import autokeras as ak
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

def load_data(file_path):
    df = pd.read_csv(file_path)
    df['churn'] = df['churn'].map({'Yes': 1, 'No': 0})
    X = df[['age', 'tenure', 'monthly_charges']]
    y = df['churn']
    return X, y

def split_data(X, y, test_size=0.2, random_state=42):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
    return X_train, X_test, y_train, y_test

def train_automl_model(X_train, y_train, max_trials=10, epochs=20):
    clf = ak.StructuredDataClassifier(max_trials=max_trials, overwrite=True)
    clf.fit(X_train, y_train, epochs=epochs, validation_split=0.2)
    return clf

def evaluate_model(clf, X_test, y_test):
    y_pred = clf.predict(X_test).flatten().astype(int)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"Test Accuracy: {accuracy:.4f}")
    print("Classification Report:")
    print(classification_report(y_test, y_pred))
    return y_pred

def save_model(clf, model_path="automl_model"):
    model = clf.export_model()
    model.save(model_path)
    print(f"Model saved to {model_path}")

def main():
    file_path = "churn_data.csv"
    X, y = load_data(file_path)
    X_train, X_test, y_train, y_test = split_data(X, y)
    clf = train_automl_model(X_train, y_train, max_trials=10, epochs=20)
    evaluate_model(clf, X_test, y_test)
    save_model(clf, "churn_automl_model")

if __name__ == "__main__":
    main()

Step 4: Run the Pipeline

Ensure Dataset: Place churn_data.csv in the script’s directory.
Execute:
```
python autokeras_pipeline.py
```
Output: Shows accuracy and classification report; saves the model.

Step 5: Make Predictions Locally

import tensorflow as tf
import pandas as pd

def load_model(model_path="churn_automl_model"):
    model = tf.keras.models.load_model(model_path)
    return model

def predict(model, input_data):
    input_df = pd.DataFrame([input_data], columns=['age', 'tenure', 'monthly_charges'])
    prediction = model.predict(input_df)
    return "Yes" if prediction[0][0] > 0.5 else "No"

model = load_model("churn_automl_model")
input_data = {"age": 30, "tenure": 12, "monthly_charges": 50.0}
prediction = predict(model, input_data)
print(f"Predicted churn: {prediction}")

Run:

python predict_autokeras.py

Step 6: Best Practices for AutoKeras

Data Preprocessing: Handle missing values and categorical encoding.
Hyperparameter Tuning: Adjust max_trials and epochs.
Hardware: Use a GPU for faster training.
Model Export: Save in TensorFlow SavedModel format.
Monitoring: Track resource usage to avoid crashes.

Comparison: Google Cloud AutoML vs. AutoKeras

Feature	Google Cloud AutoML	AutoKeras (Local)
Execution	Cloud-based	Local machine
Cost	Pay-per-use (GCP charges)	Free (open-source)
Scalability	High (cloud infrastructure)	Limited by local hardware
Ease of Use	Beginner-friendly, managed	Requires some ML knowledge
Customization	Limited to Google’s framework	Highly customizable
Use Case	Large datasets, production	Prototyping, small datasets

Conclusion

Google Cloud AutoML offers a robust cloud-based solution for large-scale projects, while AutoKeras enables local AutoML pipelines for cost-free prototyping. This tutorial provides complete pipelines for both, ensuring you can choose the right tool for your needs. Explore other local tools like H2O AutoML or TPOT for more options, and share your AutoML journey in the comments!

Search This Blog

8-Chems