Segment a retail catalog by Recency, Frequency, and Monetary value, then visualize the result as an interactive 3D scatter.
This notebook is fully self-contained — it loads the synthetic dataset directly from the project’s GitHub repository and runs end-to-end with no local setup. Click Open in Colab and go.
Standardize the three RFM dimensions to unit scale (otherwise value — measured in thousands of euros — would swamp the others), then run k-means for a sweep of k values and look for the elbow:
Code
features = StandardScaler().fit_transform(rfm[["recency", "frequency", "value"]])ks =range(1, 11)inertias = [KMeans(n_clusters=k, random_state=42, n_init=10).fit(features).inertia_ for k in ks]elbow = px.line(x=list(ks), y=inertias, markers=True, labels={"x": "Number of clusters", "y": "Within-cluster SS"}, title="Elbow plot — pick where the curve bends")elbow.update_layout(width=720, height=360)elbow.show()
The bend lands around k = 4. Fit that, then rank clusters from best (high frequency + value, low recency) to worst:
Code
K =4km = KMeans(n_clusters=K, random_state=42, n_init=25).fit(features)rfm["cluster"] = km.labels_# Centers in standardized space — combine into a single weighted scorecenters = pd.DataFrame(km.cluster_centers_, columns=["recency_z", "frequency_z", "value_z"])centers["score"] = (centers["frequency_z"] + centers["value_z"] - centers["recency_z"]) /3centers["rank"] = centers["score"].rank(ascending=False).astype(int)rfm["rank"] = rfm["cluster"].map(centers["rank"])centers.round(3)
The 3D view
Drag to rotate, scroll to zoom. Hover for the article name. Cluster numbers are ranked: rank 1 is the healthy core, rank 4 is the dying tail.
Code
plot_df = rfm.reset_index()plot_df["rank"] = plot_df["rank"].astype(str)fig = px.scatter_3d( plot_df, x="recency", y="frequency", z="value", color="rank", hover_name="article_name", color_discrete_sequence=["#27ae60", "#3498db", "#f39c12", "#c0392b"], category_orders={"rank": ["1", "2", "3", "4"]}, title="RFM space — each point is one product",)fig.update_traces(marker=dict(size=6, opacity=0.85))fig.update_layout( scene=dict( xaxis_title="Recency (months since last sale)", yaxis_title="Frequency", zaxis_title="Monetary value (EUR)", ), width=820, height=600,)fig.show()
A shallow decision tree turns the cluster boundaries into plain if-thresholds — useful when the production system shouldn’t carry a clustering library:
Code
tree = DecisionTreeClassifier(max_depth=3, random_state=42).fit( rfm[["recency", "frequency", "value"]], rfm["rank"])print(export_text( tree, feature_names=["recency", "frequency", "value"], class_names=[str(i) for i insorted(rfm["rank"].unique())],))
Takeaway
Four clusters, derived purely from the data, separate the catalog into a healthy core, a long middle, and a dying tail. The synthesis encoded declining trends for filing_cabinet and dvd_player — and they reliably end up in rank 4. The framework transfers to any entity you can summarize as (recency, frequency, monetary): customers, products, suppliers, content items.
For the R version of this same analysis (with kmeans and rpart instead of sklearn), see the Quarto site chapter.