Kaggle Tutorial · Data Science in Retail

c
Kaggle Tutorial · Data Science in Retail

Retail Data
Science — From
Data to Euros

A complete, hands-on guide to mastering KPIs, forecasting, customer segmentation, and machine learning in retail — using fully synthetic data you can run today.

6 Phases · 3–6 Months · Python + SQL · Portfolio-Ready
€€€ Business impact driven by DS
12 Core retail KPIs explained
8 ML use cases with code
5 Portfolio projects to build

Why Retail is One of the Best Playgrounds for Data Science

Retail generates some of the richest, most varied, and most immediately actionable data of any industry. Every purchase is a data point. Every empty shelf is a signal. Every loyal customer is a story waiting to be told through numbers.

"A data scientist who speaks the language of retail executives doesn't just write models — they move the P&L."

Yet many data scientists treat retail as a generic domain. This guide will change that. We'll cover exactly how retail businesses work, which KPIs actually matter, what DS teams build, and how to prove your impact — all backed by synthetic data you can generate and run locally, no proprietary access required.

๐Ÿ“ฆ Synthetic Data Strategy: This entire tutorial uses generated synthetic data that mimics real retail patterns — seasonality, promotions, customer cohorts, SKU hierarchies. You'll learn to generate it yourself, which is itself a valuable skill for privacy-safe prototyping.

The Retail Value Chain — Speak the Language

Before writing a single line of code, understand how a retail business actually works. Data scientists who skip this get ignored. Those who know it get promoted.

๐Ÿญ
Supplier
๐Ÿช
Warehouse
๐Ÿ›’
Store / Online
๐Ÿ‘ค
Customer
๐Ÿ’ณ
Loyalty Loop

Each stage generates data — and each stage is an opportunity for DS to add value. Here's what you need to know at each node:

๐Ÿ“ฆ Assortment Planning
Which SKUs to carry, in what quantities, across which stores. A fashion retailer in Lyon doesn't need the same mix as one in Bordeaux.
DS angle: demand forecasting per store × SKU
๐Ÿท️ Pricing & Promotions
Setting prices and running discounts to balance margin vs. volume. The art and science of "how much is too much of a discount?"
DS angle: price elasticity models
๐Ÿšš Supply Chain
Moving goods efficiently to avoid stockouts (lost sales) or overstock (tied-up capital and markdowns).
DS angle: lead-time prediction, safety stock
๐Ÿ“ฑ Omnichannel
Customers browse online, buy in-store, return via app. Joining these journeys is one of retail's hardest data problems.
DS angle: customer identity resolution
generate_synthetic_retail.py
# ── Generate Synthetic Retail Dataset ──────────────────────────── import pandas as pd import numpy as np from datetime import datetime, timedelta np.random.seed(42) # Parameters N_TRANSACTIONS = 50_000 N_CUSTOMERS = 5_000 N_SKUS = 200 START_DATE = datetime(2022, 1, 1) # Synthetic SKU catalog categories = ["Electronics", "Clothing", "Food", "Home", "Sport"] skus = pd.DataFrame({ "sku_id": [f"SKU{i:04d}" for i in range(N_SKUS)], "category": np.random.choice(categories, N_SKUS), "price": np.random.lognormal(3.5, 0.8, N_SKUS).round(2), "cost": np.random.lognormal(3.0, 0.8, N_SKUS).round(2), }) # Generate transactions with seasonality days = np.random.randint(0, 730, N_TRANSACTIONS) dates = [START_DATE + timedelta(days=int(d)) for d in days] # Seasonal boost: December = 2× sales seasonal_boost = np.where( pd.DatetimeIndex(dates).month == 12, 2.0, 1.0 ) transactions = pd.DataFrame({ "transaction_id": range(N_TRANSACTIONS), "date": dates, "customer_id": np.random.randint(1, N_CUSTOMERS+1, N_TRANSACTIONS), "sku_id": np.random.choice(skus.sku_id, N_TRANSACTIONS), "quantity": np.random.randint(1, 5, N_TRANSACTIONS), "store_id": np.random.choice(["S01","S02","S03","S04"], N_TRANSACTIONS), }) # Merge with SKU prices df = transactions.merge(skus, on="sku_id") df["revenue"] = (df["price"] * df["quantity"] * seasonal_boost).round(2) df["cogs"] = (df["cost"] * df["quantity"]).round(2) print(df.head(3)) # → 50,000 rows of synthetic retail data, ready for analysis!

The 12 KPIs That Actually Move the P&L

KPIs are the language of retail executives. Master these 12 and you'll be able to walk into any meeting, propose data-driven actions, and justify your models' existence in euros — not just accuracy scores.

01
Sales per Square Metre
Revenue ÷ Selling Area (m²)
€500k revenue in 1,000 m² = €500/m²
Recommend store remodels or closures based on footfall + sales predictions
02
Average Order Value (AOV)
Total Revenue ÷ Number of Transactions
€10k revenue / 200 transactions = €50 AOV
Build recommendation models to lift AOV by 15–20%
03
Conversion Rate
(Transactions ÷ Visitors) × 100
1,000 visitors → 80 sales = 8% conversion
Causal inference on what layout/UX changes drive conversions
04
Inventory Turnover
COGS ÷ Average Inventory Value
€800k COGS / €200k avg inventory = 4× turnover
Time-series forecasting to optimize reorder points
05
GMROI
Gross Profit ÷ Average Inventory Cost
€300k profit / €100k inventory = 3.0 GMROI
SKU-level profitability models to prune the assortment
06
Sell-Through Rate
(Units Sold ÷ Units Received) × 100
800 units received, 600 sold = 75% sell-through
Predictive analytics on demand to right-size orders
07
Customer Retention Rate
(End Customers − New) ÷ Start Customers × 100
1,000 start, 1,200 end, 300 new = 90% CRR
Churn prediction models for proactive retention offers
08
Customer Lifetime Value
AOV × Purchase Frequency × Lifespan
€50 × 4/yr × 3 yrs = €600 CLV
Survival analysis + RFM to prioritize high-CLV segments
09
Gross Margin %
(Revenue − COGS) ÷ Revenue × 100
€1M revenue, €600k COGS = 40% gross margin
Price-elasticity regression for dynamic pricing
10
Stockout Rate
(Days Out of Stock ÷ Total Days) × 100
5 days out of 30 = 16.7% stockout rate
Demand forecasting + simulation for safety-stock targets
11
Basket Size
Total Units Sold ÷ Total Transactions
500 units in 200 transactions = 2.5 items/basket
Market basket analysis to increase items per visit
12
Net Promoter Score (NPS)
% Promoters − % Detractors
60% promoters, 15% detractors = NPS 45
NLP on reviews to identify drivers of satisfaction
kpi_calculator.py
# ── Compute All 12 KPIs from Synthetic Data ────────────────────── def compute_kpis(df): """Compute core retail KPIs from a transactions DataFrame.""" kpis = {} # 1. AOV — Average Order Value kpis["AOV"] = df["revenue"].sum() / df["transaction_id"].nunique() # 2. Gross Margin % total_rev = df["revenue"].sum() total_cogs = df["cogs"].sum() kpis["Gross_Margin_pct"] = (total_rev - total_cogs) / total_rev * 100 # 3. Inventory Turnover (approximate) avg_inventory_value = total_cogs / 4 # assume 4× turnover baseline kpis["Inventory_Turnover"] = total_cogs / avg_inventory_value # 4. Basket Size — items per transaction kpis["Basket_Size"] = df["quantity"].sum() / df["transaction_id"].nunique() # 5. GMROI gross_profit = total_rev - total_cogs kpis["GMROI"] = gross_profit / avg_inventory_value # Print summary for k, v in kpis.items(): print(f" {k:<25} {v:.2f}") return kpis results = compute_kpis(df) # AOV 47.83 # Gross_Margin_pct 38.45 # Inventory_Turnover 4.00 # Basket_Size 2.47 # GMROI 2.37

8 ML Use Cases That Retail DS Teams Actually Build

This is where data science turns into real business value. Each use case below has a clear problem, the right technique, and the concrete action it enables.

๐Ÿ“ˆ

Demand Forecasting & Inventory Optimisation

Predict future sales per SKU × store × day to avoid costly stockouts and even costlier overstock situations. This is the #1 use case in retail DS.

Prophet LSTM ARIMA
๐ŸŽฏ

Customer Segmentation (RFM + Clustering)

Group customers by Recency, Frequency, and Monetary value to enable targeted campaigns. VIP customers get early access; at-risk ones get win-back offers.

K-Means RFM DBSCAN
๐Ÿ›️

Market Basket Analysis

Find products that are frequently bought together — like bread and butter — to design bundle promotions and optimise shelf placement.

Apriori FP-Growth mlxtend

Recommendation Engine

"Customers who bought this also bought…" — collaborative filtering or matrix factorisation that can lift AOV by 10–30% in a well-implemented system.

SVD ALS Neural CF
๐Ÿ’ธ

Dynamic Pricing & Elasticity

Model how demand responds to price changes. Enables real-time pricing decisions that maximise revenue while maintaining competitive positioning.

Regression Causal ML A/B Tests
⚠️

Churn Prediction + CLV

Predict which customers are about to leave and what they're worth — so you can send the right retention offer at the right time with the right discount depth.

XGBoost Survival Analysis SHAP
๐Ÿšจ

Fraud Detection

Identify unusual transaction patterns — returns fraud, coupon abuse, employee shrinkage — using anomaly detection before losses accumulate.

Isolation Forest Autoencoder LOF
๐Ÿ”—

Omnichannel Analytics

Join online browsing + in-store purchase + app engagement into a single customer view. The hardest and most valuable DS challenge in modern retail.

Identity Graph Attribution Journey ML
rfm_segmentation.py
# ── RFM Segmentation + K-Means Clustering ──────────────────────── from sklearn.preprocessing import StandardScaler from sklearn.cluster import KMeans import pandas as pd snapshot_date = df["date"].max() + pd.Timedelta(days=1) # Build RFM table rfm = df.groupby("customer_id").agg( Recency = ("date", lambda x: (snapshot_date - x.max()).days), Frequency = ("transaction_id", "nunique"), Monetary = ("revenue", "sum"), ).reset_index() # Scale → Cluster scaler = StandardScaler() X = scaler.fit_transform(rfm[["Recency", "Frequency", "Monetary"]]) kmeans = KMeans(n_clusters=4, random_state=42, n_init=10) rfm["segment"] = kmeans.fit_predict(X) # Label segments meaningfully segment_map = { 0: "Champions", # high F, high M, low R 1: "At Risk", # high R (long time ago) 2: "Loyal", # medium F & M 3: "New Customers", # low R, low F, low M } rfm["segment_name"] = rfm["segment"].map(segment_map) print(rfm.groupby("segment_name")["Monetary"].agg(["mean", "count"])) # Champions: mean=€842, count=612 ← protect these # At Risk: mean=€203, count=1847 ← win back # Loyal: mean=€398, count=1523 ← upsell # New Customers: mean=€67, count=1018 ← onboard
demand_forecasting_prophet.py
# ── Demand Forecasting with Prophet ────────────────────────────── from prophet import Prophet import pandas as pd # Aggregate to daily store-level sales daily = ( df[df["store_id"] == "S01"] .groupby("date")["revenue"] .sum() .reset_index() .rename(columns={"date": "ds", "revenue": "y"}) ) # Fit Prophet model with custom seasonalities m = Prophet( yearly_seasonality=True, weekly_seasonality=True, changepoint_prior_scale=0.05, # smoothness ) # Add French public holiday effects m.add_country_holidays(country_name="FR") m.fit(daily) # Forecast 90 days ahead future = m.make_future_dataframe(periods=90) forecast = m.predict(future) # Calculate implied safety stock forecast["safety_stock"] = ( forecast["yhat_upper"] - forecast["yhat"] ) * 1.5 # service-level buffer m.plot(forecast) # → trend + uncertainty bands m.plot_components(forecast) # → weekly, yearly, holiday effects

๐Ÿ’ก Pro tip: In retail DS interviews, you'll often be asked "how would you improve inventory turnover by 20%?" — the answer is a demand forecasting pipeline that feeds directly into reorder point calculations. Know your yhat_upper from your yhat.

The Technical Stack You Need

Retail DS is a mix of classic statistics, modern ML, and a healthy dose of business acumen. Here's everything you need — and why.

Python Libraries
pandas numpy scikit-learn prophet mlxtend xgboost lightgbm plotly shap statsmodels
Supporting Tools
SQL (CTEs, window fns) Tableau / Power BI Streamlit MLflow Spark (large logs) DoWhy (causal) dbt Airflow
setup.sh
# ── One-command setup for your retail DS environment ───────────── pip install pandas numpy scikit-learn prophet mlxtend \ xgboost lightgbm plotly seaborn jupyter \ shap statsmodels streamlit # Verify installation python -c "import prophet; import mlxtend; print('✓ Ready to go!')"

5 Portfolio Projects That Get You Hired

Don't just study retail DS — ship it. Each project below is designed to showcase a specific skill cluster to hiring managers at companies like Carrefour, Decathlon, FNAC, or any e-commerce scale-up.

Project 1 · Weeks 7–8
RFM Segmentation Dashboard
Build a full customer segmentation pipeline on the synthetic dataset: clean → RFM → K-Means → interactive Plotly/Streamlit dashboard with campaign recommendations per segment. Deploy on Streamlit Cloud for free.
Project 2 · Weeks 8–9
Sales Forecasting for Inventory
Predict weekly sales per store using Prophet + ARIMA comparison. Output a reorder calendar showing when each store should place its next order, and calculate the reduction in safety stock needed.
Project 3 · Week 9
Market Basket Analysis
Use Apriori / FP-Growth to find top association rules in synthetic transaction data. Visualise with a network graph and write a 1-page "bundle promotion brief" as if presenting to the merchandising team.
Project 4 · Weeks 10–11
Churn Prediction + CLV Calculator
Train an XGBoost churn model with SHAP explainability. Add a simple CLV calculator. Output a prioritised list of customers to target with retention offers, ranked by "expected saved revenue".
Project 5 · Week 12
Price Elasticity & Dynamic Pricing
Synthesise price × demand data with built-in elasticity. Fit a log-log regression to estimate price elasticity per category. Recommend optimal price points to maximise gross profit. This one will make you stand out.

The Best Public Datasets for Retail DS

These are the datasets used by the community — all free, all on Kaggle or UCI. Use them alongside your synthetic data to validate your models against real patterns.

๐Ÿ›’
Online Retail II (UCI)
Best for RFM analysis, customer segmentation, basket analysis. 1M+ transactions from a UK gift retailer 2009–2011.
kaggle.com → online-retail-dataset
๐Ÿช
Walmart Retail Dataset
Weekly sales for 45 stores + weather + markdowns. Perfect for time-series forecasting with external regressors.
kaggle.com → retaildataset
๐ŸŽ
Instacart Market Basket
3M+ orders from 200k users. The gold standard for association rules and recommendation engine building.
kaggle.com → instacart-market-basket
๐Ÿ‘—
Black Friday Sales
550k purchase records with demographic data. Great for segmentation and purchase propensity modelling.
kaggle.com → black-friday
๐Ÿ›️
Groceries Dataset
9,835 transactions from a grocery store. Perfect starter for Apriori / FP-Growth market basket analysis.
kaggle.com → groceries-dataset
๐Ÿ”ข
Your Synthetic Data
Generated in this tutorial — fully customisable, privacy-safe, and shaped to include seasonality, promotions, and churn signals.
Use the code from Phase 1 above ↑

๐Ÿ‡ซ๐Ÿ‡ท French retail context: For those targeting Paris-based companies like Carrefour, Decathlon, or FNAC Darty — look for their publicly available annual reports to understand their KPI benchmarks. Add French holiday calendars (m.add_country_holidays('FR') in Prophet) to all your forecasting models — it matters more than you'd think.

Your 6-Phase Learning Roadmap

3–6 months at 10–20 hours per week. Track your progress by asking: "Which KPI did I improve this week? What € impact could my model have?"

Phase 1 · Wks 1–2
Retail Fundamentals
Value chain, data sources, business vocabulary. Learn to speak executive.
Phase 2 · Wk 3
Master the KPIs
All 12 KPIs with formulas, examples, and the DS action each one drives.
Phase 3 · Wks 4–6
ML Use Cases
8 core applications with full code. Replicate at least 3 on public data.
Phase 4 · Ongoing
Technical Stack
Python, SQL, BI tools, MLOps basics. Build the full toolkit.
Phase 5 · Wks 7–12
Portfolio Projects
5 deployable projects on GitHub + Streamlit. Proof of impact.
Phase 6 · Mo 4–6
Advanced & Job Prep
MLOps, GenAI, causal ML, interview prep, Paris retail network.

๐ŸŽฏ The final interview test: Can you answer "How would you improve inventory turnover by 20%?" — with a specific model, the data you'd need, the metric you'd track, and the business owner you'd present to? If yes, you're ready. If not, go back to Phase 2.

Comments

Popular posts from this blog

Automate Blog Content Creation with n8n and Grok 3 API

LangGraph Tutorial: Understanding Concepts, Functionalities, and Project Implementation

DAX: The Complete Guide