Retail Customer & Basket Analysis

End-to-end customer and product analytics on a synthetic retail transaction dataset, mixing classic and modern techniques across R and Python — from market basket analysis and BCG/RFM segmentation through probabilistic CLV to survival, forecasting, embeddings, and causal uplift.

Chapters

  1. Data Audit — catalog hygiene, bundle tagging, and triage framework that runs before any analysis.
  2. Association Rules — market basket analysis with arules (R) and mlxtend (Python), side by side.
  3. BCG-style Clustering — products positioned by market share × growth (Python / scikit-learn).
  4. RFM Clustering — articles segmented by Recency, Frequency, Monetary value (R / kmeans + interactive 3D with plotly).
  5. Customer Lifetime Value — probabilistic CLV with BG/NBD + Gamma-Gamma (Python / lifetimes).
  6. Survival Analysis — Kaplan-Meier and Cox PH on time-to-first-repeat (Python / lifelines).
  7. Demand Forecasting — monthly revenue per product group with naive, ETS and SARIMA baselines (Python / statsmodels).
  8. Product Embeddings — co-occurrence × PPMI × SVD for substitution lookup and semantic clustering (Python / scikit-learn).
  9. Causal Uplift — meta-learner CATE estimation: did the discount cause the repeat purchase? (Python / scikit-learn).
  10. Insights — synthesis of the upstream chapters with cross-cuts and headline recommendations.

Plus a dashboard view of the same findings, and a standalone Python notebook that replays the RFM analysis in Colab.

Data overview

The dataset is synthesized to mirror a two-year transaction log from a furniture / home-goods retailer.

Rows ~6,400 line items
Transactions ~3,500 baskets
Customers 2,400 (heavy-tailed transaction count, BG/NBD-shaped)
Articles 66 SKUs across 40 distinct product names
Date range 2015-07-01 to 2017-06-30
Categories 10 (dining, living, bedroom, office, lighting, electronics, decor, storage, kitchen, outdoor)

The synthesis encodes co-purchase patterns (e.g. dining_table → dining_chair at 85%), temporal trends (some articles grow over the window, others decline), and customer-level structure (lifetime + transaction rate per customer, BG/NBD-style). Full schema and design rationale: docs/DATA_SPEC.md.

For the methodological roadmap and notes on lower-priority deferred items, see docs/ROADMAP.md.

Reproducing locally

# Generate the deterministic synthetic dataset (seed=42)
python scripts/generate_synthetic_data.py

# Render the site
quarto render
quarto preview          # auto-reload during editing