Association Rules — Market Basket Analysis

Given a pile of receipts, association rule mining discovers patterns of the form “customers who buy A and B also tend to buy C”. The textbook algorithm is Apriori — fast, deterministic, and the standard starting point.

A rule \(A \Rightarrow B\) is characterized by three numbers:

We mine rules at minimum support 0.001 and minimum confidence 0.5, then sort by confidence. Two implementations run side by side: arules in R and mlxtend in Python.

Data

For market basket analysis we only need two columns: the basket identifier (transaction_id) and the item identifier. We use article_name, not article_id — so a “sofa” rule isn’t fragmented across three SKUs of the same product.

Code
library(dplyr)
library(readr)

.data_path <- if (file.exists("data/raw/transactions.csv")) "data/raw/transactions.csv" else "data/synthetic/transactions.csv"
raw <- read_delim(.data_path, delim = ";", show_col_types = FALSE)
cat("rows:", nrow(raw), "  baskets:", n_distinct(raw$transaction_id), "  items:", n_distinct(raw$article_name), "\n")
rows: 6392   baskets: 3515   items: 40 

R — arules

Build the transaction object

Code
library(arules)
# C locale needed for stable sort of items across baskets; without it, real
# data with German characters (Kopfstütze, Eßtisch, …) hits a sparse-matrix
# invariant violation when arules constructs its internal indexing.
Sys.setlocale("LC_COLLATE", "C")
[1] "C"
Code
# Defensive: drop missing / empty article names before grouping. Real data
# typically has a long tail of unparseable rows (returns, miscoded items, ...);
# arules can't construct its sparse matrix if any items are NA.
clean <- raw[!is.na(raw$article_name) & nchar(trimws(raw$article_name)) > 0, ]
clean$article_name <- trimws(clean$article_name)

baskets <- split(clean$article_name, clean$transaction_id)
baskets <- lapply(baskets, function(b) sort(unique(b)))  # sort + dedupe
trans   <- as(baskets, "transactions")
summary(trans)
transactions as itemMatrix in sparse format with
 3515 rows (elements/itemsets/transactions) and
 40 columns (items) and a density of 0.0454623 

most frequent items:
dining_chair     mattress         sofa coffee_table          bed      (Other) 
         580          398          337          322          285         4470 

element (itemset/transaction) length distribution:
sizes
   1    2    3    4    5    6 
1861  844  509  206   78   17 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   1.000   1.000   1.818   2.000   6.000 

includes extended item information - examples:
     labels
1  armchair
2 bar_stool
3       bed

includes extended transaction information - examples:
  transactionID
1        T00001
2        T00002
3        T00003

Item frequency

The 15 most-purchased products:

Code
itemFrequencyPlot(trans, topN = 15, type = "absolute",
                  col = "steelblue", main = "")
Figure 1: Top 15 articles by basket appearance

dining_chair leads — exactly as designed: chairs are bought in sets of 2/4/6 and they ride along on every dining-table purchase.

Mining rules

Code
rules <- apriori(trans,
                 parameter = list(support = 0.001, confidence = 0.5),
                 control   = list(verbose = FALSE))
cat("Total rules found:", length(rules), "\n")
Total rules found: 214 

Apriori returns hundreds of rules at this threshold. Sorting by confidence puts the most reliable ones first — but the very top is dominated by complex rules of the form {A, B, C, D} ⇒ E. These are technically high-confidence (often 100%) but cover so few baskets that they’re noisy.

The headline patterns are the simple rules: one item on the left, one on the right.

Code
simple <- subset(rules, size(rules) == 2)
simple_sorted <- sort(simple, by = "confidence", decreasing = TRUE)
inspect(head(simple_sorted, 10))
    lhs                rhs             support    confidence coverage  
[1] {dining_table}  => {dining_chair}  0.06230441 0.8081181  0.07709815
[2] {kitchen_table} => {kitchen_chair} 0.02873400 0.7372263  0.03897582
[3] {bed}           => {mattress}      0.05604552 0.6912281  0.08108108
[4] {garden_table}  => {garden_chair}  0.01365576 0.6666667  0.02048364
[5] {headboard}     => {bed}           0.02702703 0.6089744  0.04438122
[6] {sideboard}     => {dining_table}  0.02105263 0.6016260  0.03499289
[7] {desk}          => {office_chair}  0.02759602 0.5914634  0.04665718
[8] {sideboard}     => {dining_chair}  0.01849218 0.5284553  0.03499289
    lift      count
[1]  4.897474 219  
[2]  9.492126 101  
[3]  6.104690 197  
[4] 14.202020  48  
[5]  7.510684  95  
[6]  7.803378  74  
[7]  8.315976  97  
[8]  3.202621  65  

These are all simple rules sorted by confidence — including the definitional ones (bed ⇒ mattress, headboard ⇒ bed) that aren’t really insights. The Python section below applies a triage filter (symmetry + accessory tag) that separates actionable rules from definitional plumbing. The raw output here is intentionally unfiltered for the R/Python comparison; the curated headline view is in the next section.

Confidence range across the 8 simple rules: ~53–81%. Lift range: 3–14×. The strongest pairs co-occur 4–10× more often than chance — but as we’ll see, “co-occurs more than chance” isn’t the same as “is a useful cross-sell prompt”.

What about the complex rules?

Multi-item rules can hit 100% confidence because they’re highly specific — {bed, mattress, table_extension} ⇒ dining_table triggers only on the rare baskets where someone happens to buy bedroom and dining furniture together. They’re not noise, but they’re more interesting as patterns to investigate than as rules to act on.

Visualizing rules

Code
library(arulesViz)
plot(rules, method = "scatterplot",
     measure = c("support", "confidence"), shading = "lift",
     engine = "ggplot2")
Figure 2: Rule landscape: each point is one rule. Top-right corner = high support and high confidence (the most actionable rules).

Python — mlxtend

The same analysis with the standard Python stack:

Code
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

from pathlib import Path
_data_path = "data/raw/transactions.csv" if Path("data/raw/transactions.csv").exists() else "data/synthetic/transactions.csv"
df = pd.read_csv(_data_path, sep=";")
# Drop rows whose article_name didn't survive the family-mapping (NaN or empty
# string). Mixing NaN floats with string items would crash apriori's internal
# sort with TypeError on real data.
df = df[df["article_name"].notna() & (df["article_name"].astype(str).str.strip() != "")].copy()
df["article_name"] = df["article_name"].astype(str).str.strip()
baskets = df.groupby("transaction_id")["article_name"].apply(lambda s: list(set(s))).tolist()

te = TransactionEncoder()
basket_matrix = pd.DataFrame(te.fit_transform(baskets), columns=te.columns_)

freq_items = apriori(basket_matrix, min_support=0.001, use_colnames=True)
rules = association_rules(
    freq_items, num_itemsets=len(basket_matrix),
    metric="confidence", min_threshold=0.5,
)
print(f"Total rules found: {len(rules)}")
Total rules found: 234

Triage before display — separating insight from plumbing

A naive sort by confidence puts rules like bed → mattress and headboard → bed at the top. Those are textbook definitional relationships — bed components sell as a system; customers don’t really choose between buying-the-mattress-or-not. But not every within-bundle pair is plumbing: dining_table → dining_chair is also within-bundle yet behaviorally meaningful — chairs are bought without tables (existing tables, replacements, sets) but tables almost never without chairs.

The right filter combines two signals:

  • Symmetry scoremin(conf(A→B), conf(B→A)) / max(...), computed from raw basket data so we can read both directions even when one falls below the apriori threshold. Symmetric rules (≥ 0.7) are pairs you can’t have one without the other (bed ↔︎ mattress).
  • Accessory marker — a small curated set of items that exist only as accessories of a primary product (headboard, nightstand, table_extension). A rule from one of these into its bundle’s primary product is structural plumbing.

Rules that fail both filters — neither symmetric nor accessory-to-primary — are the ones a stakeholder should see.

Code
is_simple = (rules["antecedents"].apply(len) == 1) & (rules["consequents"].apply(len) == 1)
simple = (
    rules.loc[is_simple]
    .assign(
        antecedent=lambda d: d["antecedents"].apply(lambda s: next(iter(s))),
        consequent=lambda d: d["consequents"].apply(lambda s: next(iter(s))),
    )
    [["antecedent", "consequent", "support", "confidence", "lift"]]
    .reset_index(drop=True)
)

ACCESSORY_ITEMS = {"headboard", "nightstand", "table_extension"}
bundle_lookup = (
    df.drop_duplicates("article_name").set_index("article_name")["bundle_group"]
      .fillna("").to_dict()
)
basket_sets = df.groupby("transaction_id")["article_name"].apply(set).tolist()

def cond_prob(a, b):
    """P(a | b) — fraction of baskets containing b that also contain a."""
    n_b = sum(1 for s in basket_sets if b in s)
    if n_b == 0:
        return 0.0
    n_both = sum(1 for s in basket_sets if a in s and b in s)
    return n_both / n_b

simple["reverse_conf"] = simple.apply(
    lambda r: cond_prob(r["antecedent"], r["consequent"]), axis=1
)
simple["symmetry"] = simple.apply(
    lambda r: min(r["confidence"], r["reverse_conf"]) / max(r["confidence"], r["reverse_conf"])
        if max(r["confidence"], r["reverse_conf"]) > 0 else 0.0,
    axis=1,
)
simple["within_bundle"] = simple.apply(
    lambda r: bundle_lookup.get(r["antecedent"], "") == bundle_lookup.get(r["consequent"], "")
              and bundle_lookup.get(r["antecedent"], "") != "",
    axis=1,
)
simple["accessory_to_primary"] = (
    simple["antecedent"].isin(ACCESSORY_ITEMS) & simple["within_bundle"]
)

def classify(row):
    if row["accessory_to_primary"]:
        return "definitional (accessory → primary)"
    if row["symmetry"] >= 0.7:
        return "definitional (symmetric pair)"
    return "actionable"

simple["triage"] = simple.apply(classify, axis=1)

print("Rules by triage class:")
Rules by triage class:
Code
for cls, n in simple["triage"].value_counts().items():
    print(f"  {cls:40s} {n}")
  actionable                               6
  definitional (accessory → primary)       1
  definitional (symmetric pair)            1

Top actionable rules — the headline

These are the rules a stakeholder should see. Cross-bundle, asymmetric: one product genuinely drives purchase of another:

Code
actionable = simple[simple["triage"] == "actionable"].sort_values("confidence", ascending=False)
actionable.head(10)[["antecedent", "consequent", "support", "confidence", "lift",
                     "reverse_conf", "symmetry"]]\
    .round({"support": 4, "confidence": 3, "lift": 2, "reverse_conf": 3, "symmetry": 2})
      antecedent     consequent  support  ...   lift  reverse_conf  symmetry
3   dining_table   dining_chair   0.0623  ...   4.90         0.378      0.47
7  kitchen_table  kitchen_chair   0.0287  ...   9.49         0.370      0.50
6   garden_table   garden_chair   0.0137  ...  14.20         0.291      0.44
5      sideboard   dining_table   0.0211  ...   7.80         0.273      0.45
2           desk   office_chair   0.0276  ...   8.32         0.388      0.66
4      sideboard   dining_chair   0.0185  ...   3.20         0.112      0.21

[6 rows x 7 columns]

These are deployable cross-sell triggers — exactly the rules to wire into “you might also need” prompts at checkout.

Plumbing — for transparency only

For completeness, the rules we filtered out as definitional. They’re correct rules; they’re just not insights:

Code
plumbing = simple[simple["triage"] != "actionable"].sort_values("confidence", ascending=False)
plumbing.head(10)[["antecedent", "consequent", "confidence", "reverse_conf", "symmetry", "triage"]]\
    .round({"confidence": 3, "reverse_conf": 3, "symmetry": 2})
  antecedent consequent  ...  symmetry                              triage
1        bed   mattress  ...      0.72       definitional (symmetric pair)
0  headboard        bed  ...      0.55  definitional (accessory → primary)

[2 rows x 6 columns]

Notice that the symmetric pairs (bed ↔︎ mattress, headboard ↔︎ bed) and within-bundle relationships are exactly what we’d expect: you can’t have a bed without a mattress, customers who buy a headboard are buying a bed. The triage doesn’t throw out this information — it just files it correctly as catalog plumbing rather than discovered insight.

Code
import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(figsize=(8, 5))
sns.scatterplot(data=rules, x="support", y="confidence", hue="lift",
                palette="viridis", ax=ax)
ax.set_xscale("log")
ax.set_xlabel("Support (log scale)")
ax.set_ylabel("Confidence")
plt.tight_layout()
plt.show()
Figure 3: Same scatter view in Python: support × confidence, colored by lift.

R vs. Python — same answers?

Both implementations find the same total rule count and recover the same simple-rule headline patterns at identical support / confidence values — the algorithm is deterministic, not the implementation. The exact ordering of complex rules can differ when many tie on confidence (1.0), but that’s a tiebreaker artifact, not a difference in what the algorithms find.

Implementation differences worth knowing

arules (R) mlxtend (Python)
Internal representation sparse C-level transaction matrix one-hot DataFrame
Performance on large data very fast (10× ahead of mlxtend on 100k+ baskets) acceptable up to ~10k baskets
Visualization rich (arulesViz: scatter, graph, parallel coordinates, matrix) DIY with matplotlib / networkx
Subsetting / pruning rules first-class (subset(rules, ...)) pandas filtering on the rules frame
Ecosystem fit preferred when downstream is R / shiny preferred when downstream is FastAPI / serving

For exploratory work and visualization the R toolkit wins; for embedding into a Python service mlxtend is the easier deploy.

Takeaway

The Apriori algorithm recovers the engineered co-purchase patterns cleanly in both languages. For a real retailer these rules drive concrete decisions: which products to display together, what to bundle, what to suggest as upsell at checkout. Practically: dining_table shoppers should never leave the store without seeing chairs, and the bed display should sit next to the mattress and nightstand sections.