Product Embeddings — PPMI × SVD at SKU Level

The association-rules chapter answered “if A is in the basket, what else tends to be there?” — but only for pairs that actually co-occur often enough to clear a support threshold. Two products that share similar shopping contexts without ever being bought together (think: a harmony sofa and a milano sofa) won’t surface as a rule, yet they’re functionally close — substitutes, not complements.

Embeddings give every product a position in a continuous space such that contextually similar items end up near each other. Once you have that, you get for free:

The classical approach is word2vec (skip-gram with negative sampling). We use a mathematically-equivalent formulation: PPMI × Truncated SVD.

Reference: Levy & Goldberg (2014) showed that word2vec with negative sampling is an implicit factorization of a shifted PPMI matrix. PPMI + SVD makes that factorization explicit, which is more transparent and avoids the Python-3.14 wheel-build issues that kept us from gensim.

Why SKU level here?

Most of the other chapters operate at Family level (article_namebed, mattress, sofa). For embeddings we deliberately step down to SKU level (article_idB3001, B3002, L1001). Reason: the most useful application of product embeddings is out-of-stock substitution, and that’s a SKU-level decision.

Knowing that “if bed is out, suggest mattress” isn’t useful — the customer who wants a bed doesn’t want a mattress. Knowing that “if SKU B3001 (bed harmony, €1099) is out, the closest alternatives are B3002 (bed milano, €1599) and B3003 (bed kompakt, €749)” is the right answer.

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import TruncatedSVD
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity

sns.set_theme(style="whitegrid")
RANDOM_STATE = 42

from pathlib import Path
_data_path = "data/raw/transactions.csv" if Path("data/raw/transactions.csv").exists() else "data/synthetic/transactions.csv"
df = pd.read_csv(_data_path, sep=";", parse_dates=["date"])

# Defensive: ensure article_id / article_name / model are strings without NaN
# so set() and string concat don't trip on real-data quirks (unmapped families
# leave article_name="", missing models leave model="").
df = df[df["article_id"].notna() & (df["article_id"].astype(str).str.strip() != "")].copy()
df["article_id"]   = df["article_id"].astype(str).str.strip()
df["article_name"] = df["article_name"].fillna("").astype(str).str.strip()
df["model"]        = df["model"].fillna("").astype(str).str.strip()

# Lookup: article_id -> human-readable "name (model)" for display
label_lookup = (
    df.drop_duplicates("article_id")
      .assign(label=lambda d: d["article_name"] + " (" + d["model"] + ")")
      .set_index("article_id")["label"].to_dict()
)
group_lookup = (
    df.drop_duplicates("article_id")
      .assign(product_group=lambda d: d["product_group"].fillna("").astype(str))
      .set_index("article_id")["product_group"].to_dict()
)
family_lookup = df.drop_duplicates("article_id").set_index("article_id")["article_name"].to_dict()

Step 1 — Co-occurrence matrix (at SKU level)

Each basket becomes a “sentence” of SKUs. If a customer buys two bed SKUs in the same basket, they’re treated as distinct items — different from family-level analysis.

Code
baskets = df.groupby("transaction_id")["article_id"].apply(lambda s: list(set(s))).tolist()

items = sorted({a for b in baskets for a in b})
ix = {a: i for i, a in enumerate(items)}
n = len(items)

co = np.zeros((n, n), dtype=np.int64)
item_count = np.zeros(n, dtype=np.int64)
for b in baskets:
    idxs = [ix[a] for a in b]
    for i in idxs:
        item_count[i] += 1
    for i in idxs:
        for j in idxs:
            if i != j:
                co[i, j] += 1

print(f"SKUs:    {n}")
print(f"Baskets: {len(baskets)}")
print(f"Pairwise co-occurrences (off-diagonal nonzero): {(co > 0).sum() // 2} pairs")
SKUs:    66
Baskets: 3515
Pairwise co-occurrences (off-diagonal nonzero): 1462 pairs

Step 2 — PPMI (Positive Pointwise Mutual Information)

Raw counts overweight popular SKUs. PMI corrects for this by comparing observed joint probability with what we’d expect under independence:

\[\text{PMI}(i, j) = \log \frac{P(i, j)}{P(i)\, P(j)}\]

PPMI clips negative values (less co-occurrence than chance — noise on this dataset size) to zero.

Code
total_baskets = len(baskets)
P_ij = co / total_baskets
P_i  = item_count / total_baskets
expected = np.outer(P_i, P_i)

with np.errstate(divide="ignore", invalid="ignore"):
    pmi = np.log(P_ij / expected)
    pmi[~np.isfinite(pmi)] = 0
ppmi = np.maximum(pmi, 0)

ppmi_df = pd.DataFrame(ppmi, index=items, columns=items)
print(f"PPMI matrix shape: {ppmi.shape}")
print(f"Nonzero entries:    {(ppmi > 0).sum()}")
print(f"PPMI range:         {ppmi.min():.2f}{ppmi.max():.2f}")
PPMI matrix shape: (66, 66)
Nonzero entries:    832
PPMI range:         0.00 — 2.86

Step 3 — Truncated SVD → low-rank embeddings

The 66×66 PPMI matrix is mostly zero. Truncated SVD compresses it to a low-rank approximation; each SKU gets a k-dimensional vector summarizing its co-occurrence neighborhood:

\[\text{PPMI} \approx U \Sigma V^\top, \qquad \text{embedding}_i = U_i \sqrt{\Sigma}\]

We use k = 24 — slightly higher than the family-level analysis because we have more SKUs.

Code
svd = TruncatedSVD(n_components=24, random_state=RANDOM_STATE)
emb = svd.fit_transform(ppmi)
print(f"Embedding shape: {emb.shape}")
print(f"Variance kept by first 5 / 12 / 24 components: "
      f"{svd.explained_variance_ratio_[:5].sum():.0%} / "
      f"{svd.explained_variance_ratio_[:12].sum():.0%} / "
      f"{svd.explained_variance_ratio_.sum():.0%}")
Embedding shape: (66, 24)
Variance kept by first 5 / 12 / 24 components: 44% / 77% / 94%

Similarity queries — SKU level

The cosine similarity between two SKU vectors is the “how alike” score. With SKU-level embeddings, the queries get concrete:

Code
sim = cosine_similarity(emb)
sim_df = pd.DataFrame(sim, index=items, columns=items)

def neighbors(sku, top_k=5):
    s = sim_df[sku].drop(sku).sort_values(ascending=False).head(top_k)
    return pd.DataFrame({
        "sku":   s.index,
        "label": [label_lookup[i] for i in s.index],
        "cosine_sim": s.values.round(3),
    })

def query(sku):
    print(f"\n→ '{sku}' = {label_lookup[sku]}")
    print(neighbors(sku).to_string(index=False))

# Pick representative SKUs across families
# Pick representative SKUs dynamically — one popular SKU from each of the
# top product families. Works on both synthetic (66 SKUs) and real (3,500+ SKUs).
sku_count = pd.Series(item_count, index=items)
top_families = (
    df[df["article_id"].isin(items)]
      .groupby("article_name")["article_id"]
      .apply(lambda s: s.value_counts().index[0])  # most popular SKU per family
      .reset_index()
      .merge(df.groupby("article_name").size().rename("n").reset_index(), on="article_name")
      .sort_values("n", ascending=False)
      .head(6)
)
sample_skus = top_families["article_id"].tolist()
for sku in sample_skus:
    query(sku)

→ 'D2010' = dining_chair (oak)
  sku                    label  cosine_sim
D2011    dining_chair (fabric)       0.984
D2012   dining_chair (leather)       0.900
D2020          sideboard (oak)       0.838
D2030    table_extension (oak)       0.826
D2031 table_extension (walnut)       0.781

→ 'B3010' = mattress (comfort)
  sku                label  cosine_sim
B3011   mattress (premium)       0.994
B3030 nightstand (harmony)       0.870
B3020  headboard (harmony)       0.793
B3031 nightstand (kompakt)       0.773
G5001    floor_lamp (luna)       0.565

→ 'L1001' = sofa (harmony)
  sku              label  cosine_sim
L1003     sofa (kompakt)       0.958
L1002      sofa (milano)       0.957
L1010 armchair (harmony)       0.472
L1050       rug (berber)       0.442
L1020 coffee_table (oak)       0.353

→ 'L1020' = coffee_table (oak)
  sku                label  cosine_sim
L1011      armchair (luna)       0.932
L1021 coffee_table (glass)       0.928
L1010   armchair (harmony)       0.902
L1050         rug (berber)       0.897
L1051         rug (shaggy)       0.814

→ 'B3001' = bed (harmony)
  sku                label  cosine_sim
B3003        bed (kompakt)       0.978
B3002         bed (milano)       0.966
B3030 nightstand (harmony)       0.440
B3020  headboard (harmony)       0.426
G5001    floor_lamp (luna)       0.422

→ 'B3030' = nightstand (harmony)
  sku                label  cosine_sim
B3031 nightstand (kompakt)       0.877
B3010   mattress (comfort)       0.870
B3011   mattress (premium)       0.853
B3020  headboard (harmony)       0.833
G5001    floor_lamp (luna)       0.517

Reading these:

  • L1001 (sofa harmony) — closest neighbors are other sofa SKUs and adjacent living-room items (armchair, coffee_table). The substitution candidate for a stocked-out harmony sofa is the milano or kompakt sofa — exactly the answer a real OOS UI needs.
  • B3001 (bed harmony) — neighbors include the bed system components (mattress, headboard, nightstand) AND alternative bed SKUs (B3002, B3003). Both axes — substitutes and complements — show up in the same neighborhood, ranked by similarity.
  • D2001 (dining_table oak) — neighbors are the engineered co-purchase partners (dining_chair, table_extension, sideboard) plus alternative dining table SKUs.

This is the killer use case for SKU-level embeddings that the family-level analysis can’t deliver: same-family SKU substitution.

2D visualization with t-SNE

Reducing the 24-dimensional embeddings to 2D. Colored by the SKU’s product_group:

Code
groups = [group_lookup[s] for s in items]
families = [family_lookup[s] for s in items]

# Slightly higher perplexity than the family-level chart since we have more points
tsne = TSNE(n_components=2, perplexity=12, random_state=RANDOM_STATE,
            init="pca", learning_rate="auto")
emb_2d = tsne.fit_transform(emb)

fig, ax = plt.subplots(figsize=(12, 8))
unique_groups = sorted(set(groups))
palette = sns.color_palette("tab10", n_colors=len(unique_groups))
group_color = dict(zip(unique_groups, palette))

for g in unique_groups:
    mask = np.array([gg == g for gg in groups])
    ax.scatter(emb_2d[mask, 0], emb_2d[mask, 1], s=110,
               color=group_color[g], label=g, edgecolor="white", linewidth=1.2)

# Annotate with family name (avoids cluttered "B3001"-style labels for 66 points)
for i, sku in enumerate(items):
    ax.annotate(family_lookup[sku], (emb_2d[i, 0], emb_2d[i, 1]),
                fontsize=7, xytext=(4, 4), textcoords="offset points", alpha=0.75)

ax.set_xlabel("t-SNE 1")
ax.set_ylabel("t-SNE 2")
ax.legend(title="Product group", fontsize=8, loc="upper left", bbox_to_anchor=(1.0, 1.0))
plt.tight_layout()
plt.show()
Figure 1: t-SNE projection of the 66-SKU embeddings, colored by product group. Same-family SKUs cluster tightly; product groups form larger neighborhoods. The algorithm never saw the labels — it learned the structure from co-purchase context alone.

Two structures emerge:

  1. Same-family SKU clusters — the three bed SKUs sit close together; the three sofa SKUs do too. The embedding learned this from “buyers of these SKUs purchased similar other things,” not from any shared label.
  2. Category neighborhoods — bedroom-related SKUs occupy one region, dining items another. Even at SKU granularity, the catalog hierarchy emerges from pure co-purchase data.

Substitution table — deployable artifact

For every SKU, store its top-3 nearest neighbors. This is the table that goes straight into an OOS UI panel or an inventory substitution rule.

Code
def fmt_neighbor(sku, rank):
    n = neighbors(sku, 3).iloc[rank]
    return f"{n['sku']} {n['label']}"

substitutions = pd.DataFrame({
    "if_oos_sku":    items,
    "if_oos_label":  [label_lookup[s] for s in items],
    "try_1":         [fmt_neighbor(s, 0) for s in items],
    "try_2":         [fmt_neighbor(s, 1) for s in items],
    "try_3":         [fmt_neighbor(s, 2) for s in items],
})
substitutions.head(15).to_string(index=False)
'if_oos_sku         if_oos_label                       try_1                        try_2                       try_3\n     B3001        bed (harmony)         B3003 bed (kompakt)           B3002 bed (milano)  B3030 nightstand (harmony)\n     B3002         bed (milano)         B3001 bed (harmony)          B3003 bed (kompakt)   B3020 headboard (harmony)\n     B3003        bed (kompakt)         B3001 bed (harmony)           B3002 bed (milano)  B3030 nightstand (harmony)\n     B3010   mattress (comfort)    B3011 mattress (premium)   B3030 nightstand (harmony)   B3020 headboard (harmony)\n     B3011   mattress (premium)    B3010 mattress (comfort)   B3030 nightstand (harmony)   B3020 headboard (harmony)\n     B3020  headboard (harmony)  B3031 nightstand (kompakt)   B3030 nightstand (harmony)    B3011 mattress (premium)\n     B3030 nightstand (harmony)  B3031 nightstand (kompakt)     B3010 mattress (comfort)    B3011 mattress (premium)\n     B3031 nightstand (kompakt)   B3020 headboard (harmony)   B3030 nightstand (harmony)    B3010 mattress (comfort)\n     B3040   wardrobe (harmony)        L1040 ottoman (luna)     E6030 dvd_player (basic)     S8001 dresser (harmony)\n     C7001       mirror (round)     G5031 led_strip (basic) G5020 ceiling_light (modern)   G5011 table_lamp (modern)\n     C7002      mirror (framed)     G5030 led_strip (smart)         C7010 vase (ceramic)       C7030 curtain (linen)\n     C7010       vase (ceramic)       K9020 bar_stool (oak)        C7002 mirror (framed)     G5030 led_strip (smart)\n     C7020 picture_frame (wood)   G5011 table_lamp (modern)      G5030 led_strip (smart)       C7002 mirror (framed)\n     C7030      curtain (linen)     G5030 led_strip (smart)        C7002 mirror (framed) U1010 garden_table (rattan)\n     D2001   dining_table (oak) D2002 dining_table (walnut) D2003 dining_table (kompakt)       D2020 sideboard (oak)'

A few rows make the value concrete. For SKU L1001 (sofa harmony, €1199), the top-3 substitutes are alternative sofa SKUs at different price points and other living-room flagships — exactly what an OOS dialog should surface. Same logic for beds, dining tables, etc.

In a real catalog with 1000+ SKUs you wouldn’t print this — you’d store the 24-dimensional vector per SKU and answer queries on demand with faiss or pgvector. The math is the same.

How does SKU-level differ from Family-level?

If we’d run this at Family level (40 items, like the rest of the chapters), the substitution lookup would only tell us “if bed is out, try mattress” — not useful. SKU level lets us answer “which bed model is closest to the out-of-stock one.” That’s the operational value.

Question Family level SKU level
Cross-sell at checkout ✅ “with sofa, suggest coffee_table” ✅ same
Same-family substitution ❌ collapses all sofa SKUs to one point ✅ “L1001 ≈ L1002 ≈ L1003”
t-SNE shows category structure? ✅ yes, cleanly ✅ yes, with sub-clusters per family
Embedding density better (40 items) adequate (66 items)
Practical for OOS UI no (too coarse) yes

The trade: SKU level has slightly less data per item, but it answers a question that family level structurally can’t.

How does this compare to association rules?

Question Association rules Embeddings (SKU)
Two items in the same basket direct (rule fires) indirect (similar contexts)
Two SKUs, never co-bought, similar context invisible similar embeddings
Threshold tuning needed? yes (support / confidence) no (everything has a vector)
Output type discrete rules continuous geometry
Best use “if A in cart, suggest B” “given A, find substitutes”

Rules and embeddings answer different questions. Rules are for what to add to the cart; embeddings are for what to swap in when something’s missing.

Limitations

  • 66 SKUs is genuinely small. Word2vec-class methods shine with vocabularies of thousands. Here the structure is real but the geometry has wiggle room — t-SNE with different perplexity settings shifts the layout noticeably. With 1000+ SKUs the embeddings would be far more stable.
  • No customer dimension. These are product-side embeddings. Adding customer embeddings would let us recommend SKUs a specific customer is likely to want, not just substitutes for a given SKU. That’s the next step — collaborative filtering / two-tower models.
  • Ratings of model variants are functional, not perceptual. The embedding says bed harmony is similar to bed milano because they appear in similar baskets. It doesn’t capture why — same target customer, similar price point, same supplier, etc. Adding metadata (price, supplier, color) into the embedding (concatenate-then-PCA, or a metadata-aware model) would sharpen the substitution suggestions.
  • PPMI × SVD vs. word2vec. Mathematically equivalent in expectation. For our 66 items either works; for a real catalog you’d use online word2vec, BPR matrix factorization, or graph neural networks on the customer-product bipartite graph.