Data Audit — Catalog Hygiene & Bundle Tagging

This chapter runs before any analysis. Its job: surface the things that would otherwise show up as embarrassing “findings” further downstream — products that don’t belong in the catalog, items that are functional bundles where co-purchase is definitional rather than behavioral, and any structural issues a sound first pass would catch.

The methodology behind it is honest about its origin: the checks here were added after a first analytical pass surfaced exactly these issues — dvd_player flagged as “delisting candidate” when it was really a catalog-membership question, bed → mattress rules treated as insight when they’re plumbing. Adding the audit step turns those moments of “wait, that’s not insight” into a systematic upstream filter, while the rest of the chapters keep their analytical purity.

Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid")

from pathlib import Path
_data_path = "data/raw/transactions.csv" if Path("data/raw/transactions.csv").exists() else "data/synthetic/transactions.csv"
df = pd.read_csv(_data_path, sep=";", parse_dates=["date"])
print(f"Raw transactions: {len(df):,d} line items, {df['transaction_id'].nunique():,d} baskets, "
      f"{df['customer_id'].nunique():,d} customers, "
      f"{df['article_name'].nunique()} unique product names")

Raw transactions: 6,392 line items, 3,515 baskets, 2,400 customers, 40 unique product names

1. Catalog hygiene — does every product belong?

A retail catalog has an intended category mix. For our furniture / home-goods retailer, valid product groups are documented in docs/DATA_SPEC.md. The first audit check: are there any products whose presence in the catalog warrants a question?

Heuristic checks:

Product-name vs. expected category list — does the name suggest a category?
Volume threshold — does the product carry meaningful weight in the business, or is it a long-tail leftover?
Last-sale recency — has it sold recently enough to still be “live”?

Code

EXPECTED_GROUPS = {
    "DINI": "Dining",   "LIVI": "Living room",
    "BEDR": "Bedroom",  "OFFI": "Office",
    "LIGH": "Lighting", "ELEC": "Electronics",
    "DECO": "Decor",    "STOR": "Storage",
    "KITC": "Kitchen",  "OUTD": "Outdoor",
}

article_summary = (
    df.groupby(["article_name", "product_group"])
      .agg(line_items=("line_item", "count"),
           revenue=("gross_price", "sum"),
           last_sold=("date", "max"),
           first_sold=("date", "min"))
      .reset_index()
)
total_rev = article_summary["revenue"].sum()
article_summary["revenue_share"] = article_summary["revenue"] / total_rev
article_summary["days_silent"] = (df["date"].max() - article_summary["last_sold"]).dt.days

# Three audit flags
LOW_VOLUME_THRESHOLD = 0.005   # < 0.5% of revenue → review
LONG_SILENT_THRESHOLD = 90     # > 90 days no sale → review
article_summary["flag_low_volume"]  = article_summary["revenue_share"] < LOW_VOLUME_THRESHOLD
article_summary["flag_long_silent"] = article_summary["days_silent"] > LONG_SILENT_THRESHOLD
article_summary["flag_unexpected_group"] = ~article_summary["product_group"].isin(EXPECTED_GROUPS)

flagged = article_summary[
    article_summary["flag_low_volume"]
    | article_summary["flag_long_silent"]
    | article_summary["flag_unexpected_group"]
]
flagged_view = flagged[
    ["article_name", "product_group", "revenue", "revenue_share",
     "days_silent", "flag_low_volume", "flag_long_silent", "flag_unexpected_group"]
].sort_values("revenue_share")

print(f"Flagged products: {len(flagged_view)} of {len(article_summary)} total\n")
flagged_view.round({"revenue": 0, "revenue_share": 4})

Flagged products: 14 of 40 total

	article_name	product_group	revenue	revenue_share	days_silent	flag_low_volume	flag_long_silent	flag_unexpected_group
27	picture_frame	DECO	1828.0	0.0007	22	True	False	False
12	dvd_player	ELEC	1940.0	0.0008	114	True	True	False
20	led_strip	LIGH	2854.0	0.0011	0	True	False	False
38	vase	DECO	3237.0	0.0013	1	True	False	False
7	curtain	DECO	6084.0	0.0024	3	True	False	False
26	parasol	OUTD	7057.0	0.0028	5	True	False	False
29	shoe_rack	STOR	7873.0	0.0031	2	True	False	False
30	side_table	LIVI	7891.0	0.0031	9	True	False	False
34	speakers	ELEC	9685.0	0.0038	15	True	False	False
14	floor_lamp	LIGH	10111.0	0.0040	1	True	False	False
5	ceiling_light	LIGH	10465.0	0.0041	0	True	False	False
1	bar_stool	KITC	11532.0	0.0045	2	True	False	False
36	table_lamp	LIGH	12634.0	0.0050	0	True	False	False
13	filing_cabinet	OFFI	13299.0	0.0052	93	False	True	False

Reading the table:

flag_unexpected_group=True would catch products whose product_group code is outside the documented set (e.g., a pet-supplies code in a furniture catalog). On this dataset all groups match — clean.
flag_low_volume=True flags products under 0.5% of revenue. These aren’t automatic delisting candidates — but they earn a manual review, particularly when combined with other signals.
flag_long_silent=True catches products with no sales in the last 90 days. These are typical end-of-life candidates.

Decision rule for this analysis: products flagged by both low_volume AND long_silent are excluded from headline product-side findings. They remain in the dataset (downstream calculations still see them) but the recommendations table won’t single them out as “delist this”.

Code

discontinued = flagged[
    flagged["flag_low_volume"] & flagged["flag_long_silent"]
]["article_name"].tolist()
print(f"Treated as effectively discontinued: {discontinued}")

Treated as effectively discontinued: ['dvd_player']

This is the difference between “the analysis says delist dvd_player” (presented as insight) and “the catalog audit identifies dvd_player as already end-of-life — delisting is operational, not analytical” (presented as plumbing). Both are true; the second framing is more honest.

2. Bundle groups — definitional co-purchase

Some product pairs are functionally inseparable: you can’t have a bed without a mattress, a dining table is useless without chairs. When association-rule mining surfaces bed → mattress, that’s not a discovered insight about customer behavior — it’s the definition of the products.

Best practice: maintain a product hierarchy that knows which items are components of the same product system, and use the hierarchy as a filter on co-purchase findings.

The synthetic dataset already carries this structure in the bundle_group column:

Code

bundle_summary = (
    df[df["bundle_group"].fillna("").ne("")]
      .groupby("bundle_group")
      .agg(items=("article_name", lambda s: ", ".join(sorted(set(s)))),
           n_items=("article_name", "nunique"),
           revenue=("gross_price", "sum"),
           transactions=("transaction_id", "nunique"))
      .reset_index()
      .sort_values("revenue", ascending=False)
)
bundle_summary["revenue_share"] = bundle_summary["revenue"] / total_rev
bundle_summary[["bundle_group", "n_items", "items", "revenue", "revenue_share", "transactions"]]\
    .round({"revenue": 0, "revenue_share": 3})

	bundle_group	n_items	items	revenue	revenue_share	transactions
0	bed_system	4	bed, headboard, mattress, nightstand	687431.0	0.271	713
1	dining_system	4	dining_chair, dining_table, sideboard, table_e...	445446.0	0.175	778
4	office_system	4	bookshelf, desk, filing_cabinet, office_chair	158775.0	0.063	423
3	kitchen_system	3	bar_stool, kitchen_chair, kitchen_table	94714.0	0.037	398
2	garden_system	3	garden_chair, garden_table, parasol	45717.0	0.018	227

Each bundle has 3–4 functional components. Rules that fire within a bundle (e.g., bed → mattress — both in bed_system) are tagged “definitional”; rules that fire across bundles (e.g., dining_table → wine_glasses) are tagged “behavioral / cross-category” and graded as more interesting.

Code

standalone_count = df[df["bundle_group"].fillna("").eq("")]["article_name"].nunique()
bundle_count = df[df["bundle_group"].fillna("").ne("")]["article_name"].nunique()
print(f"Bundle members:    {bundle_count} products in {bundle_summary['bundle_group'].nunique()} bundles")
print(f"Standalone items:  {standalone_count} products (not bundle components)")

Bundle members:    18 products in 5 bundles
Standalone items:  22 products (not bundle components)

3. Symmetry detection — the “you can’t have one without the other” signal

A second mechanical filter: rules where A → B and B → A both have high confidence are almost always definitional. Real behavioral rules are asymmetric — dining_table → dining_chair is high-confidence, but dining_chair → dining_table is much lower (people buy spare chairs without buying a new table).

We compute this systematically in the association chapter (chapter 01). The audit step here just establishes the framework: any rule with both directions ≥ 60% confidence earns the “symmetric” tag and is filtered from the headline list.

4. Product hierarchy — picking the right granularity per analysis

The catalog isn’t flat. It’s a 5-level hierarchy:

Department  →  Category (product_group)  →  Family (article_name)  →  Model  →  SKU (article_id)
   6                10                            40                  ~20      66

Different analyses want different levels:

Code

hier = (
    df.groupby("department")
      .agg(
          categories=("product_group", lambda s: s.nunique()),
          families=("article_name", lambda s: s.nunique()),
          skus=("article_id", lambda s: s.nunique()),
          revenue=("gross_price", "sum"),
          line_items=("line_item", "count"),
      )
      .sort_values("revenue", ascending=False)
)
hier["revenue_share"] = hier["revenue"] / hier["revenue"].sum()
hier.round({"revenue": 0, "revenue_share": 3})

	categories	families	skus	revenue	line_items	revenue_share
department
Living	4	18	29	956087.0	2447	0.377
Bedroom	1	5	9	769246.0	1190	0.303
Dining	2	7	15	540159.0	1689	0.213
Office	1	4	6	158775.0	567	0.063
Storage	1	3	3	68596.0	203	0.027
Outdoor	1	3	4	45717.0	296	0.018

A few observations the analyses below will lean on:

Living dominates — over a third of revenue. Any catalog-wide finding will reflect Living patterns disproportionately.
Outdoor and Storage are small — fewer than 10 SKUs each. Aggregating to Department lets us forecast their revenue at all (per-Category Outdoor would be too sparse).
Bedroom is a single category but high-revenue, mostly because beds + mattresses are big-ticket items. Family-level analysis here matters more than within-Department drilldown.

For each downstream analysis, the granularity choice goes into the chapter intro: forecasting at Department level (more data per series), clustering at Family level (interpretable items), embeddings at SKU level (finest functional differences).

5. Audit summary — what the downstream chapters now know

After this chapter, every analysis downstream has access to:

Field	Meaning
`department`	One of 6 store-floor sections (above category)
`bundle_group`	Functional product-system membership (or empty for standalone)
`flag_low_volume` (computed)	< 0.5% of revenue
`flag_long_silent` (computed)	> 90 days since last sale
`flag_discontinued`	Both `low_volume` AND `long_silent` — operationally end-of-life

Code

audit_table = article_summary.assign(
    flag_discontinued=lambda d: d["flag_low_volume"] & d["flag_long_silent"]
)[["article_name", "product_group", "revenue", "revenue_share",
   "days_silent", "flag_low_volume", "flag_long_silent", "flag_discontinued"]]

print(f"{len(audit_table)} articles audited, "
      f"{int(audit_table['flag_discontinued'].sum())} flagged as discontinued")
audit_table.head(10).round({"revenue": 0, "revenue_share": 4})

40 articles audited, 1 flagged as discontinued

	article_name	product_group	revenue	revenue_share	days_silent	flag_low_volume	flag_long_silent	flag_discontinued
0	armchair	LIVI	117653.0	0.0463	0	False	False	False
1	bar_stool	KITC	11532.0	0.0045	2	True	False	False
2	bed	BEDR	311953.0	0.1229	3	False	False	False
3	bookshelf	OFFI	25920.0	0.0102	5	False	False	False
4	cabinet	STOR	35225.0	0.0139	36	False	False	False
5	ceiling_light	LIGH	10465.0	0.0041	0	True	False	False
6	coffee_table	LIVI	102915.0	0.0405	3	False	False	False
7	curtain	DECO	6084.0	0.0024	3	True	False	False
8	desk	OFFI	52936.0	0.0209	5	False	False	False
9	dining_chair	DINI	88480.0	0.0349	1	False	False	False

A note on iteration

This audit chapter is a refactor, not pristine first-thought work. It was added after the first analytical pass surfaced findings that domain experts would have caught instantly — the dvd_player situation, the bed → mattress rules, products lurking in the bottom of every cluster. Surfacing those earlier as “data quality / catalog hygiene” issues, rather than letting them appear as analytical conclusions, is exactly the iterative refinement that’s standard in real analytical work.

Showing the iteration here, instead of pretending the audit was always there, is the more honest framing — and arguably the more useful one to readers learning analytical practice.