---
title: "Association Rules — Market Basket Analysis"
---
Given a pile of receipts, *association rule mining* discovers patterns of the form
*"customers who buy A and B also tend to buy C"*. The textbook algorithm is
[Apriori](https://en.wikipedia.org/wiki/Apriori_algorithm) — fast, deterministic, and the standard starting point.
A rule $A \Rightarrow B$ is characterized by three numbers:
- **Support** — fraction of baskets containing $A \cup B$, i.e. $P(A \cup B)$. High support = the rule covers a meaningful share of customers.
- **Confidence** — given $A$ in the basket, how likely is $B$, i.e. $P(B \mid A)$. High confidence = the rule is reliable.
- **Lift** — how much more often $B$ shows up when $A$ is present compared to its baseline rate, i.e. $\frac{P(B \mid A)}{P(B)}$. Lift > 1 = positive association.
We mine rules at minimum support 0.001 and minimum confidence 0.5, then sort by confidence. Two implementations run side by side: `arules` in R and `mlxtend` in Python.
## Data
For market basket analysis we only need two columns: the basket identifier (`transaction_id`) and the item identifier. We use `article_name`, not `article_id` — so a "sofa" rule isn't fragmented across three SKUs of the same product.
```{r}
#| label: load-data-r
library(dplyr)
library(readr)
.data_path <- if (file.exists("data/raw/transactions.csv")) "data/raw/transactions.csv" else "data/synthetic/transactions.csv"
raw <- read_delim(.data_path, delim = ";", show_col_types = FALSE)
cat("rows:", nrow(raw), " baskets:", n_distinct(raw$transaction_id), " items:", n_distinct(raw$article_name), "\n")
```
## R — `arules`
### Build the transaction object
```{r}
#| label: build-transactions
library(arules)
# C locale needed for stable sort of items across baskets; without it, real
# data with German characters (Kopfstütze, Eßtisch, …) hits a sparse-matrix
# invariant violation when arules constructs its internal indexing.
Sys.setlocale("LC_COLLATE", "C")
# Defensive: drop missing / empty article names before grouping. Real data
# typically has a long tail of unparseable rows (returns, miscoded items, ...);
# arules can't construct its sparse matrix if any items are NA.
clean <- raw[!is.na(raw$article_name) & nchar(trimws(raw$article_name)) > 0, ]
clean$article_name <- trimws(clean$article_name)
baskets <- split(clean$article_name, clean$transaction_id)
baskets <- lapply(baskets, function(b) sort(unique(b))) # sort + dedupe
trans <- as(baskets, "transactions")
summary(trans)
```
### Item frequency
The 15 most-purchased products:
```{r}
#| label: fig-item-frequency
#| fig-cap: "Top 15 articles by basket appearance"
itemFrequencyPlot(trans, topN = 15, type = "absolute",
col = "steelblue", main = "")
```
`dining_chair` leads — exactly as designed: chairs are bought in sets of 2/4/6 and they ride along on every dining-table purchase.
### Mining rules
```{r}
#| label: apriori-r
rules <- apriori(trans,
parameter = list(support = 0.001, confidence = 0.5),
control = list(verbose = FALSE))
cat("Total rules found:", length(rules), "\n")
```
Apriori returns hundreds of rules at this threshold. Sorting by confidence puts the most reliable ones first — but the very top is dominated by *complex* rules of the form `{A, B, C, D} ⇒ E`. These are technically high-confidence (often 100%) but cover so few baskets that they're noisy.
The headline patterns are the **simple** rules: one item on the left, one on the right.
```{r}
#| label: simple-rules-r
simple <- subset(rules, size(rules) == 2)
simple_sorted <- sort(simple, by = "confidence", decreasing = TRUE)
inspect(head(simple_sorted, 10))
```
These are *all* simple rules sorted by confidence — including the definitional ones (`bed ⇒ mattress`, `headboard ⇒ bed`) that aren't really insights. The Python section below applies a **triage filter** (symmetry + accessory tag) that separates actionable rules from definitional plumbing. The raw output here is intentionally unfiltered for the R/Python comparison; the curated headline view is in the next section.
Confidence range across the 8 simple rules: ~53–81%. Lift range: 3–14×. The strongest pairs co-occur 4–10× more often than chance — but as we'll see, "co-occurs more than chance" isn't the same as "is a useful cross-sell prompt".
### What about the complex rules?
Multi-item rules can hit 100% confidence because they're highly specific — `{bed, mattress, table_extension} ⇒ dining_table` triggers only on the rare baskets where someone happens to buy bedroom and dining furniture together. They're not noise, but they're more interesting as patterns to investigate than as rules to act on.
### Visualizing rules
```{r}
#| label: fig-rules-scatter
#| fig-cap: "Rule landscape: each point is one rule. Top-right corner = high support **and** high confidence (the most actionable rules)."
library(arulesViz)
plot(rules, method = "scatterplot",
measure = c("support", "confidence"), shading = "lift",
engine = "ggplot2")
```
## Python — `mlxtend`
The same analysis with the standard Python stack:
```{python}
#| label: apriori-python
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules
from pathlib import Path
_data_path = "data/raw/transactions.csv" if Path("data/raw/transactions.csv").exists() else "data/synthetic/transactions.csv"
df = pd.read_csv(_data_path, sep=";")
# Drop rows whose article_name didn't survive the family-mapping (NaN or empty
# string). Mixing NaN floats with string items would crash apriori's internal
# sort with TypeError on real data.
df = df[df["article_name"].notna() & (df["article_name"].astype(str).str.strip() != "")].copy()
df["article_name"] = df["article_name"].astype(str).str.strip()
baskets = df.groupby("transaction_id")["article_name"].apply(lambda s: list(set(s))).tolist()
te = TransactionEncoder()
basket_matrix = pd.DataFrame(te.fit_transform(baskets), columns=te.columns_)
freq_items = apriori(basket_matrix, min_support=0.001, use_colnames=True)
rules = association_rules(
freq_items, num_itemsets=len(basket_matrix),
metric="confidence", min_threshold=0.5,
)
print(f"Total rules found: {len(rules)}")
```
### Triage before display — separating insight from plumbing
A naive sort by confidence puts rules like `bed → mattress` and `headboard → bed` at the top. Those are textbook *definitional* relationships — bed components sell as a system; customers don't really choose between buying-the-mattress-or-not. But not every within-bundle pair is plumbing: `dining_table → dining_chair` is also within-bundle yet behaviorally meaningful — chairs are bought without tables (existing tables, replacements, sets) but tables almost never without chairs.
The right filter combines two signals:
- **Symmetry score** — `min(conf(A→B), conf(B→A)) / max(...)`, computed from raw basket data so we can read both directions even when one falls below the apriori threshold. Symmetric rules (≥ 0.7) are pairs you can't have one without the other (`bed ↔ mattress`).
- **Accessory marker** — a small curated set of items that exist *only* as accessories of a primary product (`headboard`, `nightstand`, `table_extension`). A rule from one of these into its bundle's primary product is structural plumbing.
Rules that fail both filters — neither symmetric nor accessory-to-primary — are the ones a stakeholder should see.
```{python}
#| label: simple-and-triage
is_simple = (rules["antecedents"].apply(len) == 1) & (rules["consequents"].apply(len) == 1)
simple = (
rules.loc[is_simple]
.assign(
antecedent=lambda d: d["antecedents"].apply(lambda s: next(iter(s))),
consequent=lambda d: d["consequents"].apply(lambda s: next(iter(s))),
)
[["antecedent", "consequent", "support", "confidence", "lift"]]
.reset_index(drop=True)
)
ACCESSORY_ITEMS = {"headboard", "nightstand", "table_extension"}
bundle_lookup = (
df.drop_duplicates("article_name").set_index("article_name")["bundle_group"]
.fillna("").to_dict()
)
basket_sets = df.groupby("transaction_id")["article_name"].apply(set).tolist()
def cond_prob(a, b):
"""P(a | b) — fraction of baskets containing b that also contain a."""
n_b = sum(1 for s in basket_sets if b in s)
if n_b == 0:
return 0.0
n_both = sum(1 for s in basket_sets if a in s and b in s)
return n_both / n_b
simple["reverse_conf"] = simple.apply(
lambda r: cond_prob(r["antecedent"], r["consequent"]), axis=1
)
simple["symmetry"] = simple.apply(
lambda r: min(r["confidence"], r["reverse_conf"]) / max(r["confidence"], r["reverse_conf"])
if max(r["confidence"], r["reverse_conf"]) > 0 else 0.0,
axis=1,
)
simple["within_bundle"] = simple.apply(
lambda r: bundle_lookup.get(r["antecedent"], "") == bundle_lookup.get(r["consequent"], "")
and bundle_lookup.get(r["antecedent"], "") != "",
axis=1,
)
simple["accessory_to_primary"] = (
simple["antecedent"].isin(ACCESSORY_ITEMS) & simple["within_bundle"]
)
def classify(row):
if row["accessory_to_primary"]:
return "definitional (accessory → primary)"
if row["symmetry"] >= 0.7:
return "definitional (symmetric pair)"
return "actionable"
simple["triage"] = simple.apply(classify, axis=1)
print("Rules by triage class:")
for cls, n in simple["triage"].value_counts().items():
print(f" {cls:40s} {n}")
```
### Top *actionable* rules — the headline
These are the rules a stakeholder should see. Cross-bundle, asymmetric: one product genuinely drives purchase of another:
```{python}
#| label: actionable-rules
actionable = simple[simple["triage"] == "actionable"].sort_values("confidence", ascending=False)
actionable.head(10)[["antecedent", "consequent", "support", "confidence", "lift",
"reverse_conf", "symmetry"]]\
.round({"support": 4, "confidence": 3, "lift": 2, "reverse_conf": 3, "symmetry": 2})
```
These are deployable cross-sell triggers — exactly the rules to wire into "you might also need" prompts at checkout.
### Plumbing — for transparency only
For completeness, the rules we filtered out as definitional. They're correct rules; they're just not insights:
```{python}
#| label: plumbing-rules
plumbing = simple[simple["triage"] != "actionable"].sort_values("confidence", ascending=False)
plumbing.head(10)[["antecedent", "consequent", "confidence", "reverse_conf", "symmetry", "triage"]]\
.round({"confidence": 3, "reverse_conf": 3, "symmetry": 2})
```
Notice that the symmetric pairs (`bed ↔ mattress`, `headboard ↔ bed`) and within-bundle relationships are exactly what we'd expect: you can't have a bed without a mattress, customers who buy a headboard are buying a bed. The triage doesn't *throw out* this information — it just files it correctly as catalog plumbing rather than discovered insight.
```{python}
#| label: fig-rules-python
#| fig-cap: "Same scatter view in Python: support × confidence, colored by lift."
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(figsize=(8, 5))
sns.scatterplot(data=rules, x="support", y="confidence", hue="lift",
palette="viridis", ax=ax)
ax.set_xscale("log")
ax.set_xlabel("Support (log scale)")
ax.set_ylabel("Confidence")
plt.tight_layout()
plt.show()
```
## R vs. Python — same answers?
Both implementations find the same total rule count and recover the same simple-rule headline patterns at identical support / confidence values — the algorithm is deterministic, not the implementation. The exact ordering of complex rules can differ when many tie on confidence (1.0), but that's a tiebreaker artifact, not a difference in what the algorithms find.
### Implementation differences worth knowing
| | `arules` (R) | `mlxtend` (Python) |
|---|---|---|
| Internal representation | sparse C-level transaction matrix | one-hot DataFrame |
| Performance on large data | very fast (10× ahead of mlxtend on 100k+ baskets) | acceptable up to ~10k baskets |
| Visualization | rich (`arulesViz`: scatter, graph, parallel coordinates, matrix) | DIY with matplotlib / networkx |
| Subsetting / pruning rules | first-class (`subset(rules, ...)`) | pandas filtering on the rules frame |
| Ecosystem fit | preferred when downstream is R / shiny | preferred when downstream is FastAPI / serving |
For exploratory work and visualization the R toolkit wins; for embedding into a Python service `mlxtend` is the easier deploy.
## Takeaway
The Apriori algorithm recovers the engineered co-purchase patterns cleanly in both languages. For a real retailer these rules drive concrete decisions: *which products to display together, what to bundle, what to suggest as upsell at checkout.* Practically: `dining_table` shoppers should never leave the store without seeing chairs, and the bed display should sit next to the mattress and nightstand sections.