BCG-style Clustering — Market Share × Growth

The Boston Consulting Group product matrix is a 50-year-old framework for thinking about a product portfolio. Each product is plotted on two axes:

Market share — what fraction of total revenue does this product account for?
Growth rate — is the product gaining or losing ground over time?

The four quadrants give the framework its names: Stars (high share, high growth), Cash Cows (high share, low growth), Question Marks (low share, high growth), Dogs (low share, low growth).

In this chapter we recreate the BCG positioning data-driven: aggregate per-product revenue across two halves of the date range, compute share and growth, then cluster the resulting 2D positions with k-means.

Setup

Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier, plot_tree

sns.set_theme(style="whitegrid")
RANDOM_STATE = 42
RANK_PALETTE = ["#27ae60", "#3498db", "#f39c12", "#c0392b"]  # rank 1 → 4, used in chapters 03 + dashboard too

Compute share and growth

We split the data window at its midpoint and aggregate revenue per product in each half:

Code

from pathlib import Path
_data_path = "data/raw/transactions.csv" if Path("data/raw/transactions.csv").exists() else "data/synthetic/transactions.csv"
df = pd.read_csv(_data_path, sep=";", parse_dates=["date"])
SPLIT = df["date"].min() + (df["date"].max() - df["date"].min()) / 2
print(f"Window: {df['date'].min().date()} → {df['date'].max().date()}, split at {SPLIT.date()}")
df["period"] = np.where(df["date"] <= SPLIT, "p1", "p2")

period_rev = (
    df.groupby(["article_name", "period"])["gross_price"].sum()
      .unstack(fill_value=0.0)
      .rename(columns={"p1": "rev_p1", "p2": "rev_p2"})
)
period_rev["revenue"] = period_rev["rev_p1"] + period_rev["rev_p2"]
period_rev["share"]   = period_rev["revenue"] / period_rev["revenue"].sum()
period_rev["growth"]  = np.where(
    period_rev["rev_p1"] > 0,
    (period_rev["rev_p2"] - period_rev["rev_p1"]) / period_rev["rev_p1"],
    np.where(period_rev["rev_p2"] > 0, 1.0, -1.0),
)
period_rev = period_rev[period_rev["share"] > 0].copy()
period_rev[["rev_p1", "rev_p2", "revenue", "share", "growth"]].head()

Window: 2015-07-01 → 2017-06-30, split at 2016-06-30

period	rev_p1	rev_p2	revenue	share	growth
article_name
armchair	50344.23	67309.20	117653.43	0.046346	0.336979
bar_stool	5419.74	6112.46	11532.20	0.004543	0.127814
bed	139776.21	172176.40	311952.61	0.122885	0.231800
bookshelf	13434.71	12485.36	25920.07	0.010210	-0.070664
cabinet	17042.80	18182.22	35225.02	0.013876	0.066856

A glimpse at the distributions — both are heavily skewed, which we’ll handle:

Code

fig, axes = plt.subplots(1, 2, figsize=(10, 4))
sns.histplot(period_rev["share"],  bins=30, ax=axes[0], color="steelblue")
axes[0].set_title("Share (linear)"); axes[0].set_xlabel("share")
sns.histplot(period_rev["growth"], bins=30, ax=axes[1], color="indianred")
axes[1].set_title("Growth"); axes[1].set_xlabel("growth")
plt.tight_layout()
plt.show()

Figure 1: Raw share and growth distributions. Heavy tails on both.

Choosing k — elbow method

We standardize both features and run k-means for several values of k, plotting the within-cluster sum of squares:

Code

features = period_rev[["growth", "share"]].values
X = StandardScaler().fit_transform(features)

ks = range(1, 11)
inertias = [
    KMeans(n_clusters=k, random_state=RANDOM_STATE, n_init=10).fit(X).inertia_
    for k in ks
]
fig, ax = plt.subplots(figsize=(7, 4))
ax.plot(list(ks), inertias, marker="o")
ax.set_xlabel("k"); ax.set_ylabel("Within-cluster SS")
plt.show()

Figure 2: Elbow plot. The bend around k=4 suggests four meaningful clusters.

The elbow lands around k = 4 — neat alignment with the four BCG quadrants.

Fit the clusters

Code

K = 4
km = KMeans(n_clusters=K, random_state=RANDOM_STATE, n_init=10).fit(X)
period_rev["cluster"] = km.labels_

# Rank clusters by a simple "value" weight = ½·share + ½·growth (both standardized)
centers = pd.DataFrame(km.cluster_centers_, columns=["growth_z", "share_z"])
centers["weight"] = 0.5 * centers["growth_z"] + 0.5 * centers["share_z"]
centers["rank"]   = centers["weight"].rank(ascending=False).astype(int)

# Map cluster id → rank, keeping article_name as the index of period_rev
period_rev["rank"] = period_rev["cluster"].map(centers["rank"])
period_rev[["share", "growth", "cluster", "rank"]].sort_values("rank").head()

period	share	growth	cluster	rank
article_name
bed	0.122885	0.231800	3	1
dining_table	0.098139	-0.015335	3	1
mattress	0.114366	0.261963	3	1
sofa	0.163031	0.313443	3	1
ottoman	0.009584	2.003204	2	2

The BCG plot

Heavy tails on share/growth would squash everything into one corner if plotted directly. We apply a signed cube-root transform for the visualization (clustering is still on the raw scaled features):

Code

def signed_cbrt(x):
    return np.sign(x) * np.cbrt(np.abs(x))

plot_df = period_rev.copy()
plot_df["share_t"]  = signed_cbrt(plot_df["share"])
plot_df["growth_t"] = signed_cbrt(plot_df["growth"])

# Density-aware visual properties: small dataset (synth, 40 items) keeps the
# original styling; large dataset (real, ~2,100 items) shrinks markers and
# fades them so the cluster colours stay readable through overlap.
n_items = len(plot_df)
marker_size  = 80 if n_items < 100 else 20
marker_alpha = 0.85 if n_items < 100 else 0.45
n_top_share  = 8 if n_items < 100 else 20
n_top_growth = 4 if n_items < 100 else 8

fig, ax = plt.subplots(figsize=(9, 6))
sns.scatterplot(
    data=plot_df, x="share_t", y="growth_t",
    hue="rank", palette=RANK_PALETTE, s=marker_size, alpha=marker_alpha,
    ax=ax, legend="full",
)
ax.axhline(plot_df["growth_t"].mean(), color="grey", lw=0.8, ls="--")
ax.axvline(plot_df["share_t"].mean(),  color="grey", lw=0.8, ls="--")

# Annotate the most informative points only — top by share (existing big
# players) and top by growth (rising stars). On large catalogues annotating
# everything would render the chart unreadable.
for _, r in plot_df.nlargest(n_top_share, "share").iterrows():
    ax.annotate(r.name, (r["share_t"], r["growth_t"]),
                xytext=(5, 5), textcoords="offset points", fontsize=7)
for _, r in plot_df.nlargest(n_top_growth, "growth").iterrows():
    ax.annotate(r.name, (r["share_t"], r["growth_t"]),
                xytext=(5, 5), textcoords="offset points", fontsize=7, color="darkgreen")

ax.set_xlabel("Market share (signed cube-root)")
ax.set_ylabel("Growth rate (signed cube-root)")
ax.legend(title="Cluster rank")
plt.tight_layout()
plt.show()

Figure 3: Products positioned by transformed share × growth. Quadrants split at the data-driven mean of each axis. Cluster numbers are ranked: 1 = top performers, 4 = lagging. On real data (~2,100 products) only the top performers are annotated to keep the plot readable; all points are still in the cluster fit.

Who is in each cluster?

Code

# Sort by revenue so the per-cluster members list shows the most relevant
# items first instead of an alphabetical slice — important on real data
# where each cluster can hold hundreds of items.
period_rev_by_rev = period_rev.reset_index().sort_values("revenue", ascending=False)
summary = (
    period_rev_by_rev
    .groupby("rank")
    .agg(
        n_products=("article_name", "count"),
        mean_share=("share",  "mean"),
        mean_growth=("growth", "mean"),
        members=("article_name", lambda s: ", ".join(s.iloc[:6]) + ("…" if len(s) > 6 else "")),
    )
    .reset_index()
    .sort_values("rank")
)
summary

	rank	n_products	mean_share	mean_growth	members
0	1	4	0.124605	0.197968	sofa, bed, mattress, dining_table
1	2	5	0.005743	1.932112	ottoman, garden_table, garden_chair, parasol, ...
2	3	10	0.026046	0.398297	tv, armchair, coffee_table, dining_chair, ward...
3	4	21	0.010114	-0.058284	sideboard, office_chair, kitchen_table, desk, ...

Reading the clusters in order:

Rank 1 (top right) — high share and growing. These are the products to amplify.
Rank 2 / 3 — mixed positions: high share + low growth (mature workhorses), or low share + high growth (rising bets).
Rank 4 (bottom left) — low share and shrinking. End-of-life candidates.

Deriving rules from the clusters

A decision tree on the same features tells us, in plain language, which thresholds the clustering effectively picked:

Code

tree = DecisionTreeClassifier(
    max_depth=3, random_state=RANDOM_STATE,
).fit(period_rev[["growth", "share"]], period_rev["rank"])

fig, ax = plt.subplots(figsize=(12, 6))
plot_tree(
    tree, feature_names=["growth", "share"], class_names=[str(i) for i in sorted(period_rev["rank"].unique())],
    filled=True, rounded=True, fontsize=9, ax=ax,
)
plt.tight_layout()
plt.show()

Figure 4: Decision tree learned to predict cluster rank from raw share and growth. Read each split as: ‘if growth ≤ X and share ≤ Y, then cluster Z’.

This tree could be deployed as a simple if/else in any system that needs to place a new product into the BCG framework — no full k-means re-fit required.

Coarse view — BCG at Department level

The Family-level plot above is the operational view: which products to amplify or rationalize. For executive reporting the more useful aggregation is Department — six store-floor sections instead of dozens of products. Six points are too few for k-means (the elbow is meaningless), so we just compute share and growth per department and read the quadrants directly.

Code

df_dept = df[df["department"].astype(str) != ""].copy()
period_rev_dept = (
    df_dept.groupby(["department", "period"])["gross_price"].sum()
           .unstack(fill_value=0.0)
           .rename(columns={"p1": "rev_p1", "p2": "rev_p2"})
)
period_rev_dept["revenue"] = period_rev_dept["rev_p1"] + period_rev_dept["rev_p2"]
period_rev_dept["share"]   = period_rev_dept["revenue"] / period_rev_dept["revenue"].sum()
period_rev_dept["growth"]  = np.where(
    period_rev_dept["rev_p1"] > 0,
    (period_rev_dept["rev_p2"] - period_rev_dept["rev_p1"]) / period_rev_dept["rev_p1"],
    np.where(period_rev_dept["rev_p2"] > 0, 1.0, -1.0),
)
period_rev_dept = period_rev_dept[period_rev_dept["share"] > 0].copy()

fig, ax = plt.subplots(figsize=(8, 5.5))
ax.scatter(period_rev_dept["share"], period_rev_dept["growth"],
           s=180, color="#3498db", edgecolor="white", linewidth=2)
for dept, row in period_rev_dept.iterrows():
    ax.annotate(dept, (row["share"], row["growth"]),
                xytext=(8, 4), textcoords="offset points", fontsize=10, fontweight="bold")
ax.axhline(period_rev_dept["growth"].mean(), color="grey", lw=0.8, ls="--")
ax.axvline(period_rev_dept["share"].mean(),  color="grey", lw=0.8, ls="--")
ax.set_xlabel("Market share (linear)")
ax.set_ylabel("Growth rate (period-over-period)")
plt.tight_layout()
plt.show()

period_rev_dept[["revenue", "share", "growth"]].round({"revenue": 0, "share": 3, "growth": 3})

(a) Same share × growth space, aggregated to Department. Quadrant lines at the cross-department mean. Useful for budget conversations: which sections of the floor are growing, which are mature.

period	revenue	share	growth
Bedroom	769246.0	0.303	0.276
Dining	540159.0	0.213	-0.045
Living	956087.0	0.377	0.301
Office	158775.0	0.063	-0.173
Outdoor	45717.0	0.018	1.921
Storage	68596.0	0.027	0.133

Department-level readings are coarser by design — you lose the per-product detail but gain a cleaner narrative for stakeholders who don’t want to parse a 40-point scatter plot.

Takeaway

The k=4 clustering recovers the BCG framework directly from data, and the assignments line up with the temporal trends we engineered: growing products land in the top-right, declining ones in the bottom-left, mature high-volume products as cash cows in the middle-right.

For a real retailer, the actionable read is don’t spread marketing budget evenly across the catalog. Lean into Stars and rising Question Marks, milk Cash Cows for the cash they generate, and rationalize the Dogs (delist or replace).