Building AutoML Pipelines: Cloud and Local Solutions
Building AutoML Pipelines: Cloud and Local Solutions
Automated Machine Learning (AutoML) streamlines model development by automating tasks like data preprocessing, feature engineering, model selection, and hyperparameter tuning. This tutorial covers two approaches: a cloud-based pipeline using Google Cloud AutoML and a local pipeline using AutoKeras. We’ll also clarify why Google Cloud AutoML cannot be run entirely locally and how AutoKeras fills that gap.
Why Google Cloud AutoML Requires the Cloud
Google Cloud AutoML is a powerful managed service but relies on cloud infrastructure:
- Architecture: It uses Google Cloud Storage for data and cloud servers for training and model management.
- APIs: The AutoML client communicates with Google’s servers, not local resources.
- Limitations: While you can preprocess data or run predictions locally, training and model management require a Google Cloud Platform (GCP) account and internet connection.
For local AutoML, AutoKeras (built on TensorFlow and Keras) runs entirely on your machine, making it ideal for prototyping or cost-free experimentation.
Option 1: Cloud-Based AutoML with Google Cloud AutoML
This section recreates the original pipeline using Google Cloud AutoML for a tabular dataset (e.g., customer churn prediction).
Prerequisites
- Basic knowledge of Python and machine learning concepts.
- A Google Cloud Platform (GCP) account.
- Installed tools: Python 3.7+, Google Cloud SDK, and Python libraries (
google-cloud-automl
,pandas
).
Step 1: Set Up Your Environment
- Create a Google Cloud Project:
- Go to the Google Cloud Console.
- Create a project (e.g.,
automl-pipeline-demo
). - Enable the Cloud AutoML API and Cloud Storage API.
- Install Google Cloud SDK:
- Follow the official guide.
- Authenticate with:
gcloud auth login
- Install Python Libraries:
pip install google-cloud-automl pandas
- Set Up Authentication:
- Generate a service account key (IAM & Admin > Service Accounts).
- Download the JSON key and set the environment variable:
export GOOGLE_APPLICATION_CREDENTIALS="path/to/your-service-account-key.json"
Step 2: Prepare Your Dataset
- Choose a Dataset:
- Example: A customer churn dataset with columns
age
,tenure
,monthly_charges
, andchurn
(target). - Store it as
churn_data.csv
.
- Example: A customer churn dataset with columns
- Upload to Google Cloud Storage:
- Create a bucket:
gsutil mb gs://your-bucket-name
- Upload the dataset:
gsutil cp churn_data.csv gs://your-bucket-name/churn_data.csv
- Create a bucket:
- Data Formatting:
- Ensure the CSV has a header row.
- The target column (
churn
) should contain categorical values.
Step 3: Build the AutoML Pipeline
Create a Python script to automate data ingestion, model training, and evaluation.
from google.cloud import automl
import pandas as pd
# Project and dataset settings
PROJECT_ID = "your-project-id"
BUCKET_NAME = "your-bucket-name"
DATASET_NAME = "churn_prediction"
MODEL_NAME = "churn_model"
REGION = "us-central1"
# Initialize AutoML client
client = automl.AutoMlClient()
# Step 1: Create a dataset
def create_dataset():
dataset_display_name = DATASET_NAME
dataset = {
"display_name": dataset_display_name,
"tables_dataset_metadata": {}
}
project_location = f"projects/{PROJECT_ID}/locations/{REGION}"
response = client.create_dataset(parent=project_location, dataset=dataset)
print(f"Dataset created: {response.name}")
return response.name
# Step 2: Import data
def import_data(dataset_name, gcs_uri):
input_config = {
"gcs_source": {"input_uris": [gcs_uri]}
}
response = client.import_data(name=dataset_name, input_config=input_config)
print("Data import started...")
response.result()
print("Data imported successfully")
# Step 3: Train the model
def train_model(dataset_name):
model_display_name = MODEL_NAME
model = {
"display_name": model_display_name,
"dataset_id": dataset_name.split("/")[-1],
"tables_model_metadata": {
"target_column_spec_name": "",
"train_budget_milli_node_hours": 1000
}
}
response = client.create_model(parent=f"projects/{PROJECT_ID}/locations/{REGION}", model=model)
print("Training started...")
response.result()
print(f"Model trained: {response.name}")
return response.name
# Step 4: Evaluate the model
def evaluate_model(model_name):
model = client.get_model(name=model_name)
print(f"Model: {model.display_name}")
print(f"Training time: {model.create_time}")
evaluation = client.list_model_evaluations(parent=model_name)
for eval in evaluation:
print(f"Evaluation metrics: {eval.metrics}")
# Main pipeline
def main():
dataset_name = create_dataset()
gcs_uri = f"gs://{BUCKET_NAME}/churn_data.csv"
import_data(dataset_name, gcs_uri)
model_name = train_model(dataset_name)
evaluate_model(model_name)
if __name__ == "__main__":
main()
Step 4: Run the Pipeline
- Update the Script: Replace
your-project-id
andyour-bucket-name
. - Execute:
python automl_pipeline.py
- Monitor: Training may take 1-2 hours based on dataset size and budget.
Step 5: Deploy and Predict
- Deploy the Model:
- In the Google Cloud Console, go to AutoML Tables > Models.
- Select
churn_model
and click Deploy (10-20 minutes).
- Make Predictions:
from google.cloud import automl def predict(model_name, input_data): prediction_client = automl.PredictionServiceClient() payload = { "row": { "values": [ {"number_value": input_data["age"]}, {"number_value": input_data["tenure"]}, {"number_value": input_data["monthly_charges"]} ] } } response = prediction_client.predict(name=model_name, payload=payload) for result in response.payload: print(f"Predicted class: {result.tables.value.string_value}") print(f"Confidence: {result.tables.score}") # Example usage model_name = f"projects/your-project-id/locations/us-central1/models/your-model-id" input_data = {"age": 30, "tenure": 12, "monthly_charges": 50.0} predict(model_name, input_data)
Step 6: Best Practices for Google Cloud AutoML
- Data Quality: Ensure no missing values or outliers.
- Feature Engineering: AutoML handles basics, but manual features can boost performance.
- Hyperparameter Tuning: Adjust
train_budget_milli_node_hours
. - Monitoring: Use Google Cloud Monitoring for performance and costs.
- Version Control: Store scripts and datasets in a repository (e.g., GitHub).
Option 2: Local AutoML with AutoKeras
For local execution, AutoKeras automates model architecture search and hyperparameter tuning on your machine.
Step 1: Set Up Your Environment
- Install Dependencies:
pip install autokeras pandas tensorflow scikit-learn
- Hardware: A decent CPU/GPU and 8GB+ RAM.
Step 2: Prepare Your Dataset
- Create or Download Dataset: Use
churn_data.csv
with columnsage
,tenure
,monthly_charges
,churn
. - Example CSV:
age,tenure,monthly_charges,churn 30,12,50.0,No 45,24,80.0,Yes ...
Step 3: Build the AutoML Pipeline
import pandas as pd
import autokeras as ak
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
def load_data(file_path):
df = pd.read_csv(file_path)
df['churn'] = df['churn'].map({'Yes': 1, 'No': 0})
X = df[['age', 'tenure', 'monthly_charges']]
y = df['churn']
return X, y
def split_data(X, y, test_size=0.2, random_state=42):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)
return X_train, X_test, y_train, y_test
def train_automl_model(X_train, y_train, max_trials=10, epochs=20):
clf = ak.StructuredDataClassifier(max_trials=max_trials, overwrite=True)
clf.fit(X_train, y_train, epochs=epochs, validation_split=0.2)
return clf
def evaluate_model(clf, X_test, y_test):
y_pred = clf.predict(X_test).flatten().astype(int)
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred))
return y_pred
def save_model(clf, model_path="automl_model"):
model = clf.export_model()
model.save(model_path)
print(f"Model saved to {model_path}")
def main():
file_path = "churn_data.csv"
X, y = load_data(file_path)
X_train, X_test, y_train, y_test = split_data(X, y)
clf = train_automl_model(X_train, y_train, max_trials=10, epochs=20)
evaluate_model(clf, X_test, y_test)
save_model(clf, "churn_automl_model")
if __name__ == "__main__":
main()
Step 4: Run the Pipeline
- Ensure Dataset: Place
churn_data.csv
in the script’s directory. - Execute:
python autokeras_pipeline.py
- Output: Shows accuracy and classification report; saves the model.
Step 5: Make Predictions Locally
import tensorflow as tf
import pandas as pd
def load_model(model_path="churn_automl_model"):
model = tf.keras.models.load_model(model_path)
return model
def predict(model, input_data):
input_df = pd.DataFrame([input_data], columns=['age', 'tenure', 'monthly_charges'])
prediction = model.predict(input_df)
return "Yes" if prediction[0][0] > 0.5 else "No"
model = load_model("churn_automl_model")
input_data = {"age": 30, "tenure": 12, "monthly_charges": 50.0}
prediction = predict(model, input_data)
print(f"Predicted churn: {prediction}")
Run:
python predict_autokeras.py
Step 6: Best Practices for AutoKeras
- Data Preprocessing: Handle missing values and categorical encoding.
- Hyperparameter Tuning: Adjust
max_trials
andepochs
. - Hardware: Use a GPU for faster training.
- Model Export: Save in TensorFlow SavedModel format.
- Monitoring: Track resource usage to avoid crashes.
Comparison: Google Cloud AutoML vs. AutoKeras
Feature | Google Cloud AutoML | AutoKeras (Local) |
---|---|---|
Execution | Cloud-based | Local machine |
Cost | Pay-per-use (GCP charges) | Free (open-source) |
Scalability | High (cloud infrastructure) | Limited by local hardware |
Ease of Use | Beginner-friendly, managed | Requires some ML knowledge |
Customization | Limited to Google’s framework | Highly customizable |
Use Case | Large datasets, production | Prototyping, small datasets |
Conclusion
Google Cloud AutoML offers a robust cloud-based solution for large-scale projects, while AutoKeras enables local AutoML pipelines for cost-free prototyping. This tutorial provides complete pipelines for both, ensuring you can choose the right tool for your needs. Explore other local tools like H2O AutoML or TPOT for more options, and share your AutoML journey in the comments!
Comments
Post a Comment