Retail Customer & Basket Analysis
End-to-end customer and product analytics on a synthetic retail transaction dataset, mixing classic and modern techniques across R and Python — from market basket analysis and BCG/RFM segmentation through probabilistic CLV to survival, forecasting, embeddings, and causal uplift.
Chapters
- Data Audit — catalog hygiene, bundle tagging, and triage framework that runs before any analysis.
- Association Rules — market basket analysis with
arules(R) andmlxtend(Python), side by side. - BCG-style Clustering — products positioned by market share × growth (Python / scikit-learn).
- RFM Clustering — articles segmented by Recency, Frequency, Monetary value (R / kmeans + interactive 3D with plotly).
- Customer Lifetime Value — probabilistic CLV with BG/NBD + Gamma-Gamma (Python /
lifetimes). - Survival Analysis — Kaplan-Meier and Cox PH on time-to-first-repeat (Python /
lifelines). - Demand Forecasting — monthly revenue per product group with naive, ETS and SARIMA baselines (Python /
statsmodels). - Product Embeddings — co-occurrence × PPMI × SVD for substitution lookup and semantic clustering (Python / scikit-learn).
- Causal Uplift — meta-learner CATE estimation: did the discount cause the repeat purchase? (Python / scikit-learn).
- Insights — synthesis of the upstream chapters with cross-cuts and headline recommendations.
Plus a dashboard view of the same findings, and a standalone Python notebook that replays the RFM analysis in Colab.
Data overview
The dataset is synthesized to mirror a two-year transaction log from a furniture / home-goods retailer.
| Rows | ~6,400 line items |
| Transactions | ~3,500 baskets |
| Customers | 2,400 (heavy-tailed transaction count, BG/NBD-shaped) |
| Articles | 66 SKUs across 40 distinct product names |
| Date range | 2015-07-01 to 2017-06-30 |
| Categories | 10 (dining, living, bedroom, office, lighting, electronics, decor, storage, kitchen, outdoor) |
The synthesis encodes co-purchase patterns (e.g. dining_table → dining_chair at 85%), temporal trends (some articles grow over the window, others decline), and customer-level structure (lifetime + transaction rate per customer, BG/NBD-style). Full schema and design rationale: docs/DATA_SPEC.md.
For the methodological roadmap and notes on lower-priority deferred items, see docs/ROADMAP.md.
Reproducing locally
# Generate the deterministic synthetic dataset (seed=42)
python scripts/generate_synthetic_data.py
# Render the site
quarto render
quarto preview # auto-reload during editing