Kaggle Tutorial · Data Science in Retail

Retail Data
Science — From
Data to Revenues

A complete, hands-on guide to mastering KPIs, forecasting, customer segmentation, and machine learning in retail — using fully synthetic data you can run today.

6 Phases · 3–6 Months · Python + SQL · Portfolio-Ready

€€€ Business impact driven by DS

12 Core retail KPIs explained

8 ML use cases with code

5 Portfolio projects to build

Why This Guide

Why Retail is One of the Best Playgrounds for Data Science

Retail generates some of the richest, most varied, and most immediately actionable data of any industry. Every purchase is a data point. Every empty shelf is a signal. Every loyal customer is a story waiting to be told through numbers.

"A data scientist who speaks the language of retail executives doesn't just write models — they move the P&L."

Yet many data scientists treat retail as a generic domain. This guide will change that. We'll cover exactly how retail businesses work, which KPIs actually matter, what DS teams build, and how to prove your impact — all backed by synthetic data you can generate and run locally, no proprietary access required.

📦 Synthetic Data Strategy: This entire tutorial uses generated synthetic data that mimics real retail patterns — seasonality, promotions, customer cohorts, SKU hierarchies. You'll learn to generate it yourself, which is itself a valuable skill for privacy-safe prototyping.

Phase 1 · Weeks 1–2

The Retail Value Chain — Speak the Language

Before writing a single line of code, understand how a retail business actually works. Data scientists who skip this get ignored. Those who know it get promoted.

🏭

Supplier

→

🏪

Warehouse

→

🛒

Store / Online

→

👤

Customer

→

💳

Loyalty Loop

Each stage generates data — and each stage is an opportunity for DS to add value. Here's what you need to know at each node:

📦 Assortment Planning

Which SKUs to carry, in what quantities, across which stores. A fashion retailer in Lyon doesn't need the same mix as one in Bordeaux.

DS angle: demand forecasting per store × SKU

🏷️ Pricing & Promotions

Setting prices and running discounts to balance margin vs. volume. The art and science of "how much is too much of a discount?"

DS angle: price elasticity models

🚚 Supply Chain

Moving goods efficiently to avoid stockouts (lost sales) or overstock (tied-up capital and markdowns).

DS angle: lead-time prediction, safety stock

📱 Omnichannel

Customers browse online, buy in-store, return via app. Joining these journeys is one of retail's hardest data problems.

DS angle: customer identity resolution

generate_synthetic_retail.py

# ── Generate Synthetic Retail Dataset ────────────────────────────
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

np.random.seed(42)

# Parameters
N_TRANSACTIONS = 50_000
N_CUSTOMERS    = 5_000
N_SKUS         = 200
START_DATE     = datetime(2022, 1, 1)

# Synthetic SKU catalog
categories = ["Electronics", "Clothing", "Food", "Home", "Sport"]
skus = pd.DataFrame({
    "sku_id":    [f"SKU{i:04d}" for i in range(N_SKUS)],
    "category": np.random.choice(categories, N_SKUS),
    "price":    np.random.lognormal(3.5, 0.8, N_SKUS).round(2),
    "cost":     np.random.lognormal(3.0, 0.8, N_SKUS).round(2),
})

# Generate transactions with seasonality
days = np.random.randint(0, 730, N_TRANSACTIONS)
dates = [START_DATE + timedelta(days=int(d)) for d in days]

# Seasonal boost: December = 2× sales
seasonal_boost = np.where(
    pd.DatetimeIndex(dates).month == 12, 2.0, 1.0
)

transactions = pd.DataFrame({
    "transaction_id": range(N_TRANSACTIONS),
    "date":           dates,
    "customer_id":    np.random.randint(1, N_CUSTOMERS+1, N_TRANSACTIONS),
    "sku_id":         np.random.choice(skus.sku_id, N_TRANSACTIONS),
    "quantity":       np.random.randint(1, 5, N_TRANSACTIONS),
    "store_id":       np.random.choice(["S01","S02","S03","S04"], N_TRANSACTIONS),
})

# Merge with SKU prices
df = transactions.merge(skus, on="sku_id")
df["revenue"] = (df["price"] * df["quantity"] * seasonal_boost).round(2)
df["cogs"]    = (df["cost"]  * df["quantity"]).round(2)

print(df.head(3))
# → 50,000 rows of synthetic retail data, ready for analysis!

Phase 2 · Week 3

The 12 KPIs That Actually Move the P&L

KPIs are the language of retail executives. Master these 12 and you'll be able to walk into any meeting, propose data-driven actions, and justify your models' existence in revenues — not just accuracy scores.

Sales per Square Metre

Revenue ÷ Selling Area (m²)

€500k revenue in 1,000 m² = €500/m²

Recommend store remodels or closures based on footfall + sales predictions

Average Order Value (AOV)

Total Revenue ÷ Number of Transactions

€10k revenue / 200 transactions = €50 AOV

Build recommendation models to lift AOV by 15–20%

Conversion Rate

(Transactions ÷ Visitors) × 100

1,000 visitors → 80 sales = 8% conversion

Causal inference on what layout/UX changes drive conversions

Inventory Turnover

COGS ÷ Average Inventory Value

€800k COGS / €200k avg inventory = 4× turnover

Time-series forecasting to optimize reorder points

GMROI

Gross Profit ÷ Average Inventory Cost

€300k profit / €100k inventory = 3.0 GMROI

SKU-level profitability models to prune the assortment

Sell-Through Rate

(Units Sold ÷ Units Received) × 100

800 units received, 600 sold = 75% sell-through

Predictive analytics on demand to right-size orders

Customer Retention Rate

(End Customers − New) ÷ Start Customers × 100

1,000 start, 1,200 end, 300 new = 90% CRR

Churn prediction models for proactive retention offers

Customer Lifetime Value

AOV × Purchase Frequency × Lifespan

€50 × 4/yr × 3 yrs = €600 CLV

Survival analysis + RFM to prioritize high-CLV segments

Gross Margin %

(Revenue − COGS) ÷ Revenue × 100

€1M revenue, €600k COGS = 40% gross margin

Price-elasticity regression for dynamic pricing

Stockout Rate

(Days Out of Stock ÷ Total Days) × 100

5 days out of 30 = 16.7% stockout rate

Demand forecasting + simulation for safety-stock targets

Basket Size

Total Units Sold ÷ Total Transactions

500 units in 200 transactions = 2.5 items/basket

Market basket analysis to increase items per visit

Net Promoter Score (NPS)

% Promoters − % Detractors

60% promoters, 15% detractors = NPS 45

NLP on reviews to identify drivers of satisfaction

kpi_calculator.py

# ── Compute All 12 KPIs from Synthetic Data ──────────────────────
def compute_kpis(df):
    """Compute core retail KPIs from a transactions DataFrame."""

    kpis = {}

    # 1. AOV — Average Order Value
    kpis["AOV"] = df["revenue"].sum() / df["transaction_id"].nunique()

    # 2. Gross Margin %
    total_rev  = df["revenue"].sum()
    total_cogs = df["cogs"].sum()
    kpis["Gross_Margin_pct"] = (total_rev - total_cogs) / total_rev * 100

    # 3. Inventory Turnover (approximate)
    avg_inventory_value = total_cogs / 4  # assume 4× turnover baseline
    kpis["Inventory_Turnover"] = total_cogs / avg_inventory_value

    # 4. Basket Size — items per transaction
    kpis["Basket_Size"] = df["quantity"].sum() / df["transaction_id"].nunique()

    # 5. GMROI
    gross_profit = total_rev - total_cogs
    kpis["GMROI"] = gross_profit / avg_inventory_value

    # Print summary
    for k, v in kpis.items():
        print(f"  {k:<25} {v:.2f}")

    return kpis

results = compute_kpis(df)
# AOV                       47.83
# Gross_Margin_pct          38.45
# Inventory_Turnover         4.00
# Basket_Size                2.47
# GMROI                      2.37

Phase 3 · Weeks 4–6

8 ML Use Cases That Retail DS Teams Actually Build

This is where data science turns into real business value. Each use case below has a clear problem, the right technique, and the concrete action it enables.

📈

Demand Forecasting & Inventory Optimisation

Predict future sales per SKU × store × day to avoid costly stockouts and even costlier overstock situations. This is the #1 use case in retail DS.

Prophet LSTM ARIMA

🎯

Customer Segmentation (RFM + Clustering)

Group customers by Recency, Frequency, and Monetary value to enable targeted campaigns. VIP customers get early access; at-risk ones get win-back offers.

K-Means RFM DBSCAN

🛍️

Market Basket Analysis

Find products that are frequently bought together — like bread and butter — to design bundle promotions and optimise shelf placement.

Apriori FP-Growth mlxtend

⭐

Recommendation Engine

"Customers who bought this also bought…" — collaborative filtering or matrix factorisation that can lift AOV by 10–30% in a well-implemented system.

SVD ALS Neural CF

💸

Dynamic Pricing & Elasticity

Model how demand responds to price changes. Enables real-time pricing decisions that maximise revenue while maintaining competitive positioning.

Regression Causal ML A/B Tests

⚠️

Churn Prediction + CLV

Predict which customers are about to leave and what they're worth — so you can send the right retention offer at the right time with the right discount depth.

XGBoost Survival Analysis SHAP

🚨

Fraud Detection

Identify unusual transaction patterns — returns fraud, coupon abuse, employee shrinkage — using anomaly detection before losses accumulate.

Isolation Forest Autoencoder LOF

🔗

Omnichannel Analytics

Join online browsing + in-store purchase + app engagement into a single customer view. The hardest and most valuable DS challenge in modern retail.

Identity Graph Attribution Journey ML

rfm_segmentation.py

# ── RFM Segmentation + K-Means Clustering ────────────────────────
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import pandas as pd

snapshot_date = df["date"].max() + pd.Timedelta(days=1)

# Build RFM table
rfm = df.groupby("customer_id").agg(
    Recency   = ("date",     lambda x: (snapshot_date - x.max()).days),
    Frequency = ("transaction_id", "nunique"),
    Monetary  = ("revenue", "sum"),
).reset_index()

# Scale → Cluster
scaler = StandardScaler()
X = scaler.fit_transform(rfm[["Recency", "Frequency", "Monetary"]])

kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
rfm["segment"] = kmeans.fit_predict(X)

# Label segments meaningfully
segment_map = {
    0: "Champions",    # high F, high M, low R
    1: "At Risk",       # high R (long time ago)
    2: "Loyal",         # medium F & M
    3: "New Customers", # low R, low F, low M
}
rfm["segment_name"] = rfm["segment"].map(segment_map)

print(rfm.groupby("segment_name")["Monetary"].agg(["mean", "count"]))
# Champions:     mean=€842, count=612  ← protect these
# At Risk:       mean=€203, count=1847 ← win back
# Loyal:         mean=€398, count=1523 ← upsell
# New Customers: mean=€67,  count=1018 ← onboard

demand_forecasting_prophet.py

# ── Demand Forecasting with Prophet ──────────────────────────────
from prophet import Prophet
import pandas as pd

# Aggregate to daily store-level sales
daily = (
    df[df["store_id"] == "S01"]
    .groupby("date")["revenue"]
    .sum()
    .reset_index()
    .rename(columns={"date": "ds", "revenue": "y"})
)

# Fit Prophet model with custom seasonalities
m = Prophet(
    yearly_seasonality=True,
    weekly_seasonality=True,
    changepoint_prior_scale=0.05,  # smoothness
)

# Add French public holiday effects
m.add_country_holidays(country_name="FR")

m.fit(daily)

# Forecast 90 days ahead
future   = m.make_future_dataframe(periods=90)
forecast = m.predict(future)

# Calculate implied safety stock
forecast["safety_stock"] = (
    forecast["yhat_upper"] - forecast["yhat"]
) * 1.5  # service-level buffer

m.plot(forecast)  # → trend + uncertainty bands
m.plot_components(forecast)  # → weekly, yearly, holiday effects

💡 Pro tip: In retail DS interviews, you'll often be asked "how would you improve inventory turnover by 20%?" — the answer is a demand forecasting pipeline that feeds directly into reorder point calculations. Know your yhat_upper from your yhat.

Phase 4 · Ongoing

The Technical Stack You Need

Retail DS is a mix of classic statistics, modern ML, and a healthy dose of business acumen. Here's everything you need — and why.

Python Libraries

pandas numpy scikit-learn prophet mlxtend xgboost lightgbm plotly shap statsmodels

Supporting Tools

SQL (CTEs, window fns) Tableau / Power BI Streamlit MLflow Spark (large logs) DoWhy (causal) dbt Airflow

setup.sh

# ── One-command setup for your retail DS environment ─────────────
pip install pandas numpy scikit-learn prophet mlxtend \
            xgboost lightgbm plotly seaborn jupyter \
            shap statsmodels streamlit

# Verify installation
python -c "import prophet; import mlxtend; print('✓ Ready to go!')"

Phase 5 · Weeks 7–12

5 Portfolio Projects That Get You Hired

Project 1 · Weeks 7–8

RFM Segmentation Dashboard

Build a full customer segmentation pipeline on the synthetic dataset: clean → RFM → K-Means → interactive Plotly/Streamlit dashboard with campaign recommendations per segment. Deploy on Streamlit Cloud for free.

Project 2 · Weeks 8–9

Sales Forecasting for Inventory

Predict weekly sales per store using Prophet + ARIMA comparison. Output a reorder calendar showing when each store should place its next order, and calculate the reduction in safety stock needed.

Project 3 · Week 9

Market Basket Analysis

Use Apriori / FP-Growth to find top association rules in synthetic transaction data. Visualise with a network graph and write a 1-page "bundle promotion brief" as if presenting to the merchandising team.

Project 4 · Weeks 10–11

Churn Prediction + CLV Calculator

Train an XGBoost churn model with SHAP explainability. Add a simple CLV calculator. Output a prioritised list of customers to target with retention offers, ranked by "expected saved revenue".

Project 5 · Week 12

Price Elasticity & Dynamic Pricing

Synthesise price × demand data with built-in elasticity. Fit a log-log regression to estimate price elasticity per category. Recommend optimal price points to maximise gross profit. This one will make you stand out.

Resources

The Best Public Datasets for Retail DS

These are the datasets used by the community — all free, all on Kaggle or UCI. Use them alongside your synthetic data to validate your models against real patterns.

🛒

Online Retail II (UCI)

Best for RFM analysis, customer segmentation, basket analysis. 1M+ transactions from a UK gift retailer 2009–2011.

kaggle.com → online-retail-dataset

🏪

Walmart Retail Dataset

Weekly sales for 45 stores + weather + markdowns. Perfect for time-series forecasting with external regressors.

kaggle.com → retaildataset

🍎

Instacart Market Basket

3M+ orders from 200k users. The gold standard for association rules and recommendation engine building.

kaggle.com → instacart-market-basket

👗

Black Friday Sales

550k purchase records with demographic data. Great for segmentation and purchase propensity modelling.

kaggle.com → black-friday

🛍️

Groceries Dataset

9,835 transactions from a grocery store. Perfect starter for Apriori / FP-Growth market basket analysis.

kaggle.com → groceries-dataset

🔢

Your Synthetic Data

Generated in this tutorial — fully customisable, privacy-safe, and shaped to include seasonality, promotions, and churn signals.

Use the code from Phase 1 above ↑

Specific retail context: For those targeting specific companies — look for their publicly available annual reports to understand their KPI benchmarks. Add their holiday calendars (m.add_country_holidays('??') in Prophet) to all your forecasting models — it matters more than you'd think.

The Full Picture

Your 6-Phase Learning Roadmap

3–6 months at 10–20 hours per week. Track your progress by asking: "Which KPI did I improve this week? What € impact could my model have?"

Phase 1 · Wks 1–2

Retail Fundamentals

Value chain, data sources, business vocabulary. Learn to speak executive.

Phase 2 · Wk 3

Master the KPIs

All 12 KPIs with formulas, examples, and the DS action each one drives.

Phase 3 · Wks 4–6

ML Use Cases

8 core applications with full code. Replicate at least 3 on public data.

Phase 4 · Ongoing

Technical Stack

Python, SQL, BI tools, MLOps basics. Build the full toolkit.

Phase 5 · Wks 7–12

Portfolio Projects

5 deployable projects on GitHub + Streamlit. Proof of impact.

Phase 6 · Mo 4–6

Advanced & Job Prep

MLOps, GenAI, causal ML, interview prep, Paris retail network.

Sample of relevant questions: Can you answer "How would you improve inventory turnover by 20%?" — with a specific model, the data you'd need, the metric you'd track, and the business owner you'd present to? If yes, you're ready. If not, go back to Phase 2.

Search This Blog

8-Chems

Kaggle Tutorial · Data Science in Retail

Retail Data
Science — From
Data to Revenues

Why Retail is One of the Best Playgrounds for Data Science

The Retail Value Chain — Speak the Language

The 12 KPIs That Actually Move the P&L

8 ML Use Cases That Retail DS Teams Actually Build

Demand Forecasting & Inventory Optimisation

Customer Segmentation (RFM + Clustering)

Market Basket Analysis

Recommendation Engine

Dynamic Pricing & Elasticity

Churn Prediction + CLV

Fraud Detection

Omnichannel Analytics

The Technical Stack You Need

5 Portfolio Projects That Get You Hired

The Best Public Datasets for Retail DS

Your 6-Phase Learning Roadmap

Start Today.

Comments

Post a Comment

Popular posts from this blog

Automate Blog Content Creation with n8n and Grok 3 API

LangGraph Tutorial: Understanding Concepts, Functionalities, and Project Implementation

DAX: The Complete Guide

LangGraph Tutorial: Understanding Concepts, Functionalities, and Project Implementation

A Journey Through the Product Lifecycle: ML, AI, and Software Engineering

Kaggle Tutorial · Data Science in Retail

Retail DataScience — FromData to Revenues

Why Retail is One of the Best Playgrounds for Data Science

The Retail Value Chain — Speak the Language

The 12 KPIs That Actually Move the P&L

8 ML Use Cases That Retail DS Teams Actually Build

Demand Forecasting & Inventory Optimisation

Customer Segmentation (RFM + Clustering)

Market Basket Analysis

Recommendation Engine

Dynamic Pricing & Elasticity

Churn Prediction + CLV

Fraud Detection

Omnichannel Analytics

The Technical Stack You Need

5 Portfolio Projects That Get You Hired

The Best Public Datasets for Retail DS

Your 6-Phase Learning Roadmap

Start Today.

Comments

Post a Comment

Popular posts from this blog

Automate Blog Content Creation with n8n and Grok 3 API

LangGraph Tutorial: Understanding Concepts, Functionalities, and Project Implementation

DAX: The Complete Guide

Retail Data
Science — From
Data to Revenues