Data Audit — Catalog Hygiene & Bundle Tagging

This chapter runs before any analysis. Its job: surface the things that would otherwise show up as embarrassing “findings” further downstream — products that don’t belong in the catalog, items that are functional bundles where co-purchase is definitional rather than behavioral, and any structural issues a sound first pass would catch.

The methodology behind it is honest about its origin: the checks here were added after a first analytical pass surfaced exactly these issuesdvd_player flagged as “delisting candidate” when it was really a catalog-membership question, bed → mattress rules treated as insight when they’re plumbing. Adding the audit step turns those moments of “wait, that’s not insight” into a systematic upstream filter, while the rest of the chapters keep their analytical purity.

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid")

from pathlib import Path
_data_path = "data/raw/transactions.csv" if Path("data/raw/transactions.csv").exists() else "data/synthetic/transactions.csv"
df = pd.read_csv(_data_path, sep=";", parse_dates=["date"])
print(f"Raw transactions: {len(df):,d} line items, {df['transaction_id'].nunique():,d} baskets, "
      f"{df['customer_id'].nunique():,d} customers, "
      f"{df['article_name'].nunique()} unique product names")
Raw transactions: 6,392 line items, 3,515 baskets, 2,400 customers, 40 unique product names

1. Catalog hygiene — does every product belong?

A retail catalog has an intended category mix. For our furniture / home-goods retailer, valid product groups are documented in docs/DATA_SPEC.md. The first audit check: are there any products whose presence in the catalog warrants a question?

Heuristic checks:

  • Product-name vs. expected category list — does the name suggest a category?
  • Volume threshold — does the product carry meaningful weight in the business, or is it a long-tail leftover?
  • Last-sale recency — has it sold recently enough to still be “live”?
Code
EXPECTED_GROUPS = {
    "DINI": "Dining",   "LIVI": "Living room",
    "BEDR": "Bedroom",  "OFFI": "Office",
    "LIGH": "Lighting", "ELEC": "Electronics",
    "DECO": "Decor",    "STOR": "Storage",
    "KITC": "Kitchen",  "OUTD": "Outdoor",
}

article_summary = (
    df.groupby(["article_name", "product_group"])
      .agg(line_items=("line_item", "count"),
           revenue=("gross_price", "sum"),
           last_sold=("date", "max"),
           first_sold=("date", "min"))
      .reset_index()
)
total_rev = article_summary["revenue"].sum()
article_summary["revenue_share"] = article_summary["revenue"] / total_rev
article_summary["days_silent"] = (df["date"].max() - article_summary["last_sold"]).dt.days

# Three audit flags
LOW_VOLUME_THRESHOLD = 0.005   # < 0.5% of revenue → review
LONG_SILENT_THRESHOLD = 90     # > 90 days no sale → review
article_summary["flag_low_volume"]  = article_summary["revenue_share"] < LOW_VOLUME_THRESHOLD
article_summary["flag_long_silent"] = article_summary["days_silent"] > LONG_SILENT_THRESHOLD
article_summary["flag_unexpected_group"] = ~article_summary["product_group"].isin(EXPECTED_GROUPS)

flagged = article_summary[
    article_summary["flag_low_volume"]
    | article_summary["flag_long_silent"]
    | article_summary["flag_unexpected_group"]
]
flagged_view = flagged[
    ["article_name", "product_group", "revenue", "revenue_share",
     "days_silent", "flag_low_volume", "flag_long_silent", "flag_unexpected_group"]
].sort_values("revenue_share")

print(f"Flagged products: {len(flagged_view)} of {len(article_summary)} total\n")
flagged_view.round({"revenue": 0, "revenue_share": 4})
Flagged products: 14 of 40 total
article_name product_group revenue revenue_share days_silent flag_low_volume flag_long_silent flag_unexpected_group
27 picture_frame DECO 1828.0 0.0007 22 True False False
12 dvd_player ELEC 1940.0 0.0008 114 True True False
20 led_strip LIGH 2854.0 0.0011 0 True False False
38 vase DECO 3237.0 0.0013 1 True False False
7 curtain DECO 6084.0 0.0024 3 True False False
26 parasol OUTD 7057.0 0.0028 5 True False False
29 shoe_rack STOR 7873.0 0.0031 2 True False False
30 side_table LIVI 7891.0 0.0031 9 True False False
34 speakers ELEC 9685.0 0.0038 15 True False False
14 floor_lamp LIGH 10111.0 0.0040 1 True False False
5 ceiling_light LIGH 10465.0 0.0041 0 True False False
1 bar_stool KITC 11532.0 0.0045 2 True False False
36 table_lamp LIGH 12634.0 0.0050 0 True False False
13 filing_cabinet OFFI 13299.0 0.0052 93 False True False

Reading the table:

  • flag_unexpected_group=True would catch products whose product_group code is outside the documented set (e.g., a pet-supplies code in a furniture catalog). On this dataset all groups match — clean.
  • flag_low_volume=True flags products under 0.5% of revenue. These aren’t automatic delisting candidates — but they earn a manual review, particularly when combined with other signals.
  • flag_long_silent=True catches products with no sales in the last 90 days. These are typical end-of-life candidates.

Decision rule for this analysis: products flagged by both low_volume AND long_silent are excluded from headline product-side findings. They remain in the dataset (downstream calculations still see them) but the recommendations table won’t single them out as “delist this”.

Code
discontinued = flagged[
    flagged["flag_low_volume"] & flagged["flag_long_silent"]
]["article_name"].tolist()
print(f"Treated as effectively discontinued: {discontinued}")
Treated as effectively discontinued: ['dvd_player']

This is the difference between “the analysis says delist dvd_player (presented as insight) and “the catalog audit identifies dvd_player as already end-of-life — delisting is operational, not analytical” (presented as plumbing). Both are true; the second framing is more honest.

2. Bundle groups — definitional co-purchase

Some product pairs are functionally inseparable: you can’t have a bed without a mattress, a dining table is useless without chairs. When association-rule mining surfaces bed → mattress, that’s not a discovered insight about customer behavior — it’s the definition of the products.

Best practice: maintain a product hierarchy that knows which items are components of the same product system, and use the hierarchy as a filter on co-purchase findings.

The synthetic dataset already carries this structure in the bundle_group column:

Code
bundle_summary = (
    df[df["bundle_group"].fillna("").ne("")]
      .groupby("bundle_group")
      .agg(items=("article_name", lambda s: ", ".join(sorted(set(s)))),
           n_items=("article_name", "nunique"),
           revenue=("gross_price", "sum"),
           transactions=("transaction_id", "nunique"))
      .reset_index()
      .sort_values("revenue", ascending=False)
)
bundle_summary["revenue_share"] = bundle_summary["revenue"] / total_rev
bundle_summary[["bundle_group", "n_items", "items", "revenue", "revenue_share", "transactions"]]\
    .round({"revenue": 0, "revenue_share": 3})
bundle_group n_items items revenue revenue_share transactions
0 bed_system 4 bed, headboard, mattress, nightstand 687431.0 0.271 713
1 dining_system 4 dining_chair, dining_table, sideboard, table_e... 445446.0 0.175 778
4 office_system 4 bookshelf, desk, filing_cabinet, office_chair 158775.0 0.063 423
3 kitchen_system 3 bar_stool, kitchen_chair, kitchen_table 94714.0 0.037 398
2 garden_system 3 garden_chair, garden_table, parasol 45717.0 0.018 227

Each bundle has 3–4 functional components. Rules that fire within a bundle (e.g., bed → mattress — both in bed_system) are tagged “definitional”; rules that fire across bundles (e.g., dining_table → wine_glasses) are tagged “behavioral / cross-category” and graded as more interesting.

Code
standalone_count = df[df["bundle_group"].fillna("").eq("")]["article_name"].nunique()
bundle_count = df[df["bundle_group"].fillna("").ne("")]["article_name"].nunique()
print(f"Bundle members:    {bundle_count} products in {bundle_summary['bundle_group'].nunique()} bundles")
print(f"Standalone items:  {standalone_count} products (not bundle components)")
Bundle members:    18 products in 5 bundles
Standalone items:  22 products (not bundle components)

3. Symmetry detection — the “you can’t have one without the other” signal

A second mechanical filter: rules where A → B and B → A both have high confidence are almost always definitional. Real behavioral rules are asymmetricdining_table → dining_chair is high-confidence, but dining_chair → dining_table is much lower (people buy spare chairs without buying a new table).

We compute this systematically in the association chapter (chapter 01). The audit step here just establishes the framework: any rule with both directions ≥ 60% confidence earns the “symmetric” tag and is filtered from the headline list.

4. Product hierarchy — picking the right granularity per analysis

The catalog isn’t flat. It’s a 5-level hierarchy:

Department  →  Category (product_group)  →  Family (article_name)  →  Model  →  SKU (article_id)
   6                10                            40                  ~20      66

Different analyses want different levels:

Code
hier = (
    df.groupby("department")
      .agg(
          categories=("product_group", lambda s: s.nunique()),
          families=("article_name", lambda s: s.nunique()),
          skus=("article_id", lambda s: s.nunique()),
          revenue=("gross_price", "sum"),
          line_items=("line_item", "count"),
      )
      .sort_values("revenue", ascending=False)
)
hier["revenue_share"] = hier["revenue"] / hier["revenue"].sum()
hier.round({"revenue": 0, "revenue_share": 3})
categories families skus revenue line_items revenue_share
department
Living 4 18 29 956087.0 2447 0.377
Bedroom 1 5 9 769246.0 1190 0.303
Dining 2 7 15 540159.0 1689 0.213
Office 1 4 6 158775.0 567 0.063
Storage 1 3 3 68596.0 203 0.027
Outdoor 1 3 4 45717.0 296 0.018

A few observations the analyses below will lean on:

  • Living dominates — over a third of revenue. Any catalog-wide finding will reflect Living patterns disproportionately.
  • Outdoor and Storage are small — fewer than 10 SKUs each. Aggregating to Department lets us forecast their revenue at all (per-Category Outdoor would be too sparse).
  • Bedroom is a single category but high-revenue, mostly because beds + mattresses are big-ticket items. Family-level analysis here matters more than within-Department drilldown.

For each downstream analysis, the granularity choice goes into the chapter intro: forecasting at Department level (more data per series), clustering at Family level (interpretable items), embeddings at SKU level (finest functional differences).

5. Audit summary — what the downstream chapters now know

After this chapter, every analysis downstream has access to:

Field Meaning
department One of 6 store-floor sections (above category)
bundle_group Functional product-system membership (or empty for standalone)
flag_low_volume (computed) < 0.5% of revenue
flag_long_silent (computed) > 90 days since last sale
flag_discontinued Both low_volume AND long_silent — operationally end-of-life
Code
audit_table = article_summary.assign(
    flag_discontinued=lambda d: d["flag_low_volume"] & d["flag_long_silent"]
)[["article_name", "product_group", "revenue", "revenue_share",
   "days_silent", "flag_low_volume", "flag_long_silent", "flag_discontinued"]]

print(f"{len(audit_table)} articles audited, "
      f"{int(audit_table['flag_discontinued'].sum())} flagged as discontinued")
audit_table.head(10).round({"revenue": 0, "revenue_share": 4})
40 articles audited, 1 flagged as discontinued
article_name product_group revenue revenue_share days_silent flag_low_volume flag_long_silent flag_discontinued
0 armchair LIVI 117653.0 0.0463 0 False False False
1 bar_stool KITC 11532.0 0.0045 2 True False False
2 bed BEDR 311953.0 0.1229 3 False False False
3 bookshelf OFFI 25920.0 0.0102 5 False False False
4 cabinet STOR 35225.0 0.0139 36 False False False
5 ceiling_light LIGH 10465.0 0.0041 0 True False False
6 coffee_table LIVI 102915.0 0.0405 3 False False False
7 curtain DECO 6084.0 0.0024 3 True False False
8 desk OFFI 52936.0 0.0209 5 False False False
9 dining_chair DINI 88480.0 0.0349 1 False False False

A note on iteration

This audit chapter is a refactor, not pristine first-thought work. It was added after the first analytical pass surfaced findings that domain experts would have caught instantly — the dvd_player situation, the bed → mattress rules, products lurking in the bottom of every cluster. Surfacing those earlier as “data quality / catalog hygiene” issues, rather than letting them appear as analytical conclusions, is exactly the iterative refinement that’s standard in real analytical work.

Showing the iteration here, instead of pretending the audit was always there, is the more honest framing — and arguably the more useful one to readers learning analytical practice.