Causal Uplift — Did the Discount Cause the Repeat Purchase?

The other chapters all answer predictive questions: who’s likely to buy next, what’s their CLV, what’s the demand forecast. This one answers a causal question:

Among customers who got a discount and came back, how many would have come back anyway?

That’s a much harder question, and a much more important one. Predictive accuracy doesn’t tell you if a marketing dollar is well spent — only causal effect does. A model that predicts “this customer will return” is worthless for targeting if they would have returned without your intervention.

The framework: uplift modeling. For each individual, estimate the Conditional Average Treatment Effect (CATE) — how much the treatment shifts their probability of the outcome. Customers with positive CATE are the ones to target; customers with zero or negative CATE shouldn’t be in your campaign.

Reference: Künzel, Sekhon, Bickel & Yu (2019) for the meta-learner taxonomy (S/T/X-learner) we use here. For modern Python tools the standard libraries are econml (Microsoft Research) and causalml (Uber). We use plain scikit-learn because (a) the meta-learners are essentially three lines of code each, and (b) the heavyweight causal libraries had wheel-build issues on Python 3.14.

Code

import warnings; warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import GradientBoostingClassifier

sns.set_theme(style="whitegrid")
RANDOM_STATE = 42

The setup

Treatment T	Did the customer receive a discount on their first purchase? (`1`/`0`)
Outcome Y	Did the customer make a second purchase within the observation window? (`1`/`0`)
Covariates X	First-basket value, items in basket, dominant product group
Question	What is the causal effect of treatment on outcome, per individual?

A crucial property of this dataset: discounts are assigned randomly in the synthesis. There’s no link between which customers get a discount and which customers tend to come back. That makes this an idealized setting — analogous to a randomized controlled trial. With observational data the same techniques apply but you need additional care for confounding (propensity scores, doubly-robust estimators).

Code

from pathlib import Path
_data_path = "data/raw/transactions.csv" if Path("data/raw/transactions.csv").exists() else "data/synthetic/transactions.csv"
df = pd.read_csv(_data_path, sep=";", parse_dates=["date"])

first_tx = df.sort_values("date").groupby("customer_id").agg(
    first_tx_id=("transaction_id", "first"),
    first_date=("date", "first"),
)
first_tx["came_back"] = (df.groupby("customer_id")["transaction_id"].nunique() >= 2).astype(int)

first_tx_lines = df.merge(first_tx[["first_tx_id"]].reset_index(), on="customer_id")
first_tx_lines = first_tx_lines[first_tx_lines["transaction_id"] == first_tx_lines["first_tx_id"]]

basket = first_tx_lines.groupby("customer_id").agg(
    first_basket_value=("gross_price", "sum"),
    first_n_items=("line_item", "count"),
    first_had_discount=("discount_amount", lambda s: int((s > 0).any())),
    first_top_dept=("department", lambda s: s.mode().iloc[0] if not s.mode().empty else ""),
)

data = first_tx.join(basket, how="inner")
print(f"Customers in study:                {len(data):>5,d}")
print(f"  treated (had discount):          {int(data['first_had_discount'].sum()):>5,d}  ({data['first_had_discount'].mean():.1%})")
print(f"  control (no discount):           {int((1 - data['first_had_discount']).sum()):>5,d}")
print(f"Repeat purchase rate (overall):    {data['came_back'].mean():.1%}")

Customers in study:                2,400
  treated (had discount):            593  (24.7%)
  control (no discount):           1,807
Repeat purchase rate (overall):    28.8%

Naive ATE — what does the simple comparison say?

The simplest causal estimate, valid because of random treatment assignment, is the difference in means: among treated customers, how often did they come back vs. controls?

Code

treated_rate = data.loc[data["first_had_discount"] == 1, "came_back"].mean()
control_rate = data.loc[data["first_had_discount"] == 0, "came_back"].mean()
ate_naive    = treated_rate - control_rate

# Confidence interval via bootstrap
rng = np.random.default_rng(RANDOM_STATE)
boot = []
n = len(data)
for _ in range(1000):
    idx = rng.integers(0, n, size=n)
    s = data.iloc[idx]
    t = s.loc[s["first_had_discount"] == 1, "came_back"].mean()
    c = s.loc[s["first_had_discount"] == 0, "came_back"].mean()
    boot.append(t - c)
ci = np.percentile(boot, [2.5, 97.5])
print(f"Treated repurchase rate:  {treated_rate:.3f}")
print(f"Control repurchase rate:  {control_rate:.3f}")
print(f"Naive ATE:                {ate_naive:+.4f}  (95% CI: [{ci[0]:+.3f}, {ci[1]:+.3f}])")

Treated repurchase rate:  0.275
Control repurchase rate:  0.293
Naive ATE:                -0.0179  (95% CI: [-0.058, +0.025])

The 95% confidence interval straddles zero — we cannot detect any causal effect of the first-purchase discount on whether customers come back. That’s the right answer for our generator, which doesn’t link these two variables. With real data, you might find a small positive effect (discounts attract price-sensitive returning customers), or a negative effect (discounted customers skew toward bargain-hunters who don’t return at full prices), or nothing — only the data tells you.

Per-customer uplift — meta-learners

The naive ATE is a single number for the whole population. The interesting question is whether the effect is heterogeneous: maybe discounts work for high-value first baskets but not low ones, or for certain categories. Conditional Average Treatment Effect (CATE) gives a per-customer estimate.

The T-learner trains two models — one on treated, one on control — and takes the difference of their predictions:

Code

data_enc = pd.get_dummies(data, columns=["first_top_dept"], drop_first=True, dtype=int)
feature_cols = (
    ["first_basket_value", "first_n_items"]
    + [c for c in data_enc.columns if c.startswith("first_top_dept_")]
)
X = data_enc[feature_cols].values
T = data_enc["first_had_discount"].values
Y = data_enc["came_back"].values

def t_learner(X, T, Y, seed=RANDOM_STATE):
    m_t = GradientBoostingClassifier(random_state=seed, max_depth=3, n_estimators=80)\
            .fit(X[T == 1], Y[T == 1])
    m_c = GradientBoostingClassifier(random_state=seed, max_depth=3, n_estimators=80)\
            .fit(X[T == 0], Y[T == 0])
    return m_t.predict_proba(X)[:, 1] - m_c.predict_proba(X)[:, 1]

cate_t = t_learner(X, T, Y)
print(f"T-learner CATE:  mean={cate_t.mean():+.4f}  std={cate_t.std():.4f}")
print(f"  range: [{cate_t.min():+.3f}, {cate_t.max():+.3f}]")
print(f"  fraction with CATE > 0:  {(cate_t > 0).mean():.1%}")

T-learner CATE:  mean=-0.0272  std=0.1545
  range: [-0.852, +0.700]
  fraction with CATE > 0:  40.9%

The S-learner trains a single model with treatment as a feature, then predicts twice — once with T=1, once with T=0:

Code

def s_learner(X, T, Y, seed=RANDOM_STATE):
    X_aug = np.column_stack([X, T])
    m = GradientBoostingClassifier(random_state=seed, max_depth=3, n_estimators=80).fit(X_aug, Y)
    X_t = np.column_stack([X, np.ones_like(T)])
    X_c = np.column_stack([X, np.zeros_like(T)])
    return m.predict_proba(X_t)[:, 1] - m.predict_proba(X_c)[:, 1]

cate_s = s_learner(X, T, Y)
print(f"S-learner CATE:  mean={cate_s.mean():+.4f}  std={cate_s.std():.4f}")
print(f"  range: [{cate_s.min():+.3f}, {cate_s.max():+.3f}]")

S-learner CATE:  mean=-0.0186  std=0.0278
  range: [-0.251, +0.078]

Code

fig, ax = plt.subplots(figsize=(9, 4.5))
ax.hist(cate_t, bins=40, alpha=0.6, color="#3498db", label="T-learner")
ax.hist(cate_s, bins=40, alpha=0.6, color="#27ae60", label="S-learner")
ax.axvline(0, color="grey", ls="--", lw=1)
ax.axvline(cate_t.mean(), color="#3498db", ls="-", lw=1.5)
ax.axvline(cate_s.mean(), color="#27ae60", ls="-", lw=1.5)
ax.set_xlabel("Estimated CATE (probability points)")
ax.set_ylabel("Customers")
ax.legend()
plt.tight_layout()
plt.show()

Figure 1: Distribution of per-customer CATE estimates from T-learner and S-learner. Both center near zero, as the data dictates. The S-learner is tighter — it shares parameters across treated and control, which regularizes toward smaller effects.

Both meta-learners agree: the average treatment effect is essentially zero, with most customers’ CATE within ±5 percentage points. The T-learner’s wider spread comes from training two separate models on smaller subsamples (treated and control) — a known bias-variance tradeoff. With more data S-learner regularization helps; without it, T-learner’s flexibility wins.

A sanity check — can the technique recover a real effect?

Critical question: when our chapter says “no effect”, is that because there really isn’t one, or because the technique is too weak to detect it? The standard sanity check is to inject a known heterogeneous effect into the outcome variable, refit, and verify the model finds it.

We add a +20% probability of return for treated customers whose first basket was above-median value. Treated customers with low-value baskets see no boost. Truth: high-value treated CATE = +0.20, low-value treated CATE = 0.

Code

rng = np.random.default_rng(RANDOM_STATE)
Y_synth = Y.copy().astype(float)

high_value = data["first_basket_value"].values > np.median(data["first_basket_value"].values)
boost_mask = (T == 1) & high_value
boosted = (rng.random(boost_mask.sum()) < 0.20).astype(int)
Y_synth[boost_mask] = np.clip(Y_synth[boost_mask] + boosted, 0, 1)

cate_inj = t_learner(X, T, Y_synth.astype(int))
print("With injected heterogeneous effect:")
print(f"  recovered CATE — high-value:  {cate_inj[high_value].mean():+.4f}  (truth: +0.20)")
print(f"  recovered CATE — low-value:   {cate_inj[~high_value].mean():+.4f}  (truth:  0.00)")
print(f"  recovered ATE:                {cate_inj.mean():+.4f}  (truth: +0.10 averaged)")

With injected heterogeneous effect:
  recovered CATE — high-value:  +0.1207  (truth: +0.20)
  recovered CATE — low-value:   +0.0096  (truth:  0.00)
  recovered ATE:                +0.0652  (truth: +0.10 averaged)

Code

fig, ax = plt.subplots(figsize=(9, 4.5))
ax.hist(cate_inj[high_value], bins=30, alpha=0.6, color="#c0392b", label="High-value first basket")
ax.hist(cate_inj[~high_value], bins=30, alpha=0.6, color="#27ae60", label="Low-value first basket")
ax.axvline(0,    color="grey", ls=":", lw=1)
ax.axvline(0.20, color="#c0392b", ls="--", lw=1, label="True high-value effect (+0.20)")
ax.set_xlabel("Estimated CATE (probability points)")
ax.set_ylabel("Customers")
ax.legend()
plt.tight_layout()
plt.show()

Figure 2: T-learner recovers the structure of an injected heterogeneous effect. High-value customers (truth +0.20) get pushed in the right direction; low-value customers (truth 0) stay near zero.

The recovered effect is in the right direction and the right segment, but attenuated — the high-value group’s mean CATE comes in around +0.11, not +0.20. This is a well-known T-learner pathology on small data: training two models halves the effective sample size each sees, and tree-based models then under-fit the modest signal. Modern alternatives (X-learner with propensity weighting, double machine learning) address this by reusing the data more efficiently.

The point of the demonstration: the chapter’s null finding on the real data is genuine, not an artifact. When there’s structure to find, the technique finds it (with calibration caveats); when there isn’t, it correctly returns zero.

When does this work?

Scenario	Naive ATE valid?	Meta-learners valid?
RCT / random assignment	✅ yes	✅ yes (this chapter’s setting)
Observational, no confounders	✅ yes	✅ yes
Observational, measured confounders	⚠ biased	✅ yes (with covariates)
Observational, unmeasured confounders	❌ biased	❌ biased (no remedy without more assumptions)

In a real retail setting, treatment isn’t random — discounts go to specific customers in specific contexts (loyalty status, time of year, channel). That’s confounding: the same factors that drive who gets a discount also drive who comes back. Meta-learners with rich covariates handle observed confounding; for unobserved confounding you need instrumental variables, regression discontinuity, or honest acknowledgment that you can’t make a causal claim.

How this chapter fits the rest

Chapter	Type	Output
04 — CLV	Predictive (per-customer)	Expected revenue, P(alive)
06 — Survival	Predictive (population & per-customer)	When does the return rate decay?
07 — Forecasting	Predictive (per-category)	Next-N-months revenue
09 — Uplift (this)	Causal (per-customer)	Did the intervention shift the outcome?

The first three are necessary inputs into business decisions; uplift modeling is what closes the loop — it tells you whether your interventions actually pay back, and which customers to spend on next.

Limitations

Random assignment is a luxury. Real treatments are targeted. Almost any “did this campaign work?” analysis on observational data needs serious thought about confounding before causal claims are defensible.
Binary treatment only. Real campaigns have intensities (5% vs. 15% vs. 30% off). Continuous-treatment uplift is supported by econml’s CausalForest and similar but adds complexity.
The injection demo isn’t a substitute for a real RCT. Sanity-checking with synthetic effects shows the technique works; it doesn’t validate the finding on real data.
Beyond meta-learners. Modern causal ML stack: doubly-robust estimators (DR-Learner), Double Machine Learning (Chernozhukov et al.), Causal Forest (Wager & Athey). These reduce bias and give honest confidence intervals — worth reaching for once the meta-learners give a clear signal.

--- title: "Causal Uplift — Did the Discount Cause the Repeat Purchase?" --- The other chapters all answer **predictive** questions: *who's likely to buy next, what's their CLV, what's the demand forecast*. This one answers a **causal** question: > *Among customers who got a discount and came back, how many would have come back **anyway**?* That's a much harder question, and a much more important one. Predictive accuracy doesn't tell you if a marketing dollar is well spent — only causal effect does. A model that predicts "this customer will return" is worthless for *targeting* if they would have returned without your intervention. The framework: **uplift modeling**. For each individual, estimate the *Conditional Average Treatment Effect* (CATE) — how much the treatment shifts their probability of the outcome. Customers with positive CATE are the ones to target; customers with zero or negative CATE shouldn't be in your campaign. > Reference: [Künzel, Sekhon, Bickel & Yu (2019)](https://www.pnas.org/doi/10.1073/pnas.1804597116) for the meta-learner taxonomy (S/T/X-learner) we use here. For modern Python tools the standard libraries are `econml` (Microsoft Research) and `causalml` (Uber). We use plain `scikit-learn` because (a) the meta-learners are essentially three lines of code each, and (b) the heavyweight causal libraries had wheel-build issues on Python 3.14. ```{python} #| label: setup import warnings; warnings.filterwarnings("ignore") import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from sklearn.ensemble import GradientBoostingClassifier sns.set_theme(style="whitegrid") RANDOM_STATE = 42 ``` ## The setup | | | |---|---| | **Treatment T** | Did the customer receive a discount on their first purchase? (`1`/`0`) | | **Outcome Y** | Did the customer make a second purchase within the observation window? (`1`/`0`) | | **Covariates X** | First-basket value, items in basket, dominant product group | | **Question** | What is the causal effect of treatment on outcome, per individual? | A *crucial* property of this dataset: discounts are assigned **randomly** in the synthesis. There's no link between which customers get a discount and which customers tend to come back. That makes this an idealized setting — analogous to a randomized controlled trial. With observational data the same techniques apply but you need additional care for confounding (propensity scores, doubly-robust estimators). ```{python} #| label: build-customer-table from pathlib import Path _data_path = "data/raw/transactions.csv" if Path("data/raw/transactions.csv").exists() else "data/synthetic/transactions.csv" df = pd.read_csv(_data_path, sep=";", parse_dates=["date"]) first_tx = df.sort_values("date").groupby("customer_id").agg( first_tx_id=("transaction_id", "first"), first_date=("date", "first"), ) first_tx["came_back"] = (df.groupby("customer_id")["transaction_id"].nunique() >= 2).astype(int) first_tx_lines = df.merge(first_tx[["first_tx_id"]].reset_index(), on="customer_id") first_tx_lines = first_tx_lines[first_tx_lines["transaction_id"] == first_tx_lines["first_tx_id"]] basket = first_tx_lines.groupby("customer_id").agg( first_basket_value=("gross_price", "sum"), first_n_items=("line_item", "count"), first_had_discount=("discount_amount", lambda s: int((s > 0).any())), first_top_dept=("department", lambda s: s.mode().iloc[0] if not s.mode().empty else ""), ) data = first_tx.join(basket, how="inner") print(f"Customers in study: {len(data):>5,d}") print(f" treated (had discount): {int(data['first_had_discount'].sum()):>5,d} ({data['first_had_discount'].mean():.1%})") print(f" control (no discount): {int((1 - data['first_had_discount']).sum()):>5,d}") print(f"Repeat purchase rate (overall): {data['came_back'].mean():.1%}") ``` ## Naive ATE — what does the simple comparison say? The simplest causal estimate, valid because of random treatment assignment, is the *difference in means*: among treated customers, how often did they come back vs. controls? ```{python} #| label: naive-ate treated_rate = data.loc[data["first_had_discount"] == 1, "came_back"].mean() control_rate = data.loc[data["first_had_discount"] == 0, "came_back"].mean() ate_naive = treated_rate - control_rate # Confidence interval via bootstrap rng = np.random.default_rng(RANDOM_STATE) boot = [] n = len(data) for _ in range(1000): idx = rng.integers(0, n, size=n) s = data.iloc[idx] t = s.loc[s["first_had_discount"] == 1, "came_back"].mean() c = s.loc[s["first_had_discount"] == 0, "came_back"].mean() boot.append(t - c) ci = np.percentile(boot, [2.5, 97.5]) print(f"Treated repurchase rate: {treated_rate:.3f}") print(f"Control repurchase rate: {control_rate:.3f}") print(f"Naive ATE: {ate_naive:+.4f} (95% CI: [{ci[0]:+.3f}, {ci[1]:+.3f}])") ``` The 95% confidence interval straddles zero — *we cannot detect any causal effect of the first-purchase discount on whether customers come back*. That's the right answer for our generator, which doesn't link these two variables. With real data, you might find a small positive effect (discounts attract price-sensitive returning customers), or a negative effect (discounted customers skew toward bargain-hunters who don't return at full prices), or nothing — only the data tells you. ## Per-customer uplift — meta-learners The naive ATE is a single number for the whole population. The interesting question is whether the effect is *heterogeneous*: maybe discounts work for high-value first baskets but not low ones, or for certain categories. *Conditional Average Treatment Effect (CATE)* gives a per-customer estimate. The **T-learner** trains two models — one on treated, one on control — and takes the difference of their predictions: ```{python} #| label: t-learner data_enc = pd.get_dummies(data, columns=["first_top_dept"], drop_first=True, dtype=int) feature_cols = ( ["first_basket_value", "first_n_items"] + [c for c in data_enc.columns if c.startswith("first_top_dept_")] ) X = data_enc[feature_cols].values T = data_enc["first_had_discount"].values Y = data_enc["came_back"].values def t_learner(X, T, Y, seed=RANDOM_STATE): m_t = GradientBoostingClassifier(random_state=seed, max_depth=3, n_estimators=80)\ .fit(X[T == 1], Y[T == 1]) m_c = GradientBoostingClassifier(random_state=seed, max_depth=3, n_estimators=80)\ .fit(X[T == 0], Y[T == 0]) return m_t.predict_proba(X)[:, 1] - m_c.predict_proba(X)[:, 1] cate_t = t_learner(X, T, Y) print(f"T-learner CATE: mean={cate_t.mean():+.4f} std={cate_t.std():.4f}") print(f" range: [{cate_t.min():+.3f}, {cate_t.max():+.3f}]") print(f" fraction with CATE > 0: {(cate_t > 0).mean():.1%}") ``` The **S-learner** trains a *single* model with treatment as a feature, then predicts twice — once with `T=1`, once with `T=0`: ```{python} #| label: s-learner def s_learner(X, T, Y, seed=RANDOM_STATE): X_aug = np.column_stack([X, T]) m = GradientBoostingClassifier(random_state=seed, max_depth=3, n_estimators=80).fit(X_aug, Y) X_t = np.column_stack([X, np.ones_like(T)]) X_c = np.column_stack([X, np.zeros_like(T)]) return m.predict_proba(X_t)[:, 1] - m.predict_proba(X_c)[:, 1] cate_s = s_learner(X, T, Y) print(f"S-learner CATE: mean={cate_s.mean():+.4f} std={cate_s.std():.4f}") print(f" range: [{cate_s.min():+.3f}, {cate_s.max():+.3f}]") ``` ```{python} #| label: fig-cate-distributions #| fig-cap: "Distribution of per-customer CATE estimates from T-learner and S-learner. Both center near zero, as the data dictates. The S-learner is tighter — it shares parameters across treated and control, which regularizes toward smaller effects." fig, ax = plt.subplots(figsize=(9, 4.5)) ax.hist(cate_t, bins=40, alpha=0.6, color="#3498db", label="T-learner") ax.hist(cate_s, bins=40, alpha=0.6, color="#27ae60", label="S-learner") ax.axvline(0, color="grey", ls="--", lw=1) ax.axvline(cate_t.mean(), color="#3498db", ls="-", lw=1.5) ax.axvline(cate_s.mean(), color="#27ae60", ls="-", lw=1.5) ax.set_xlabel("Estimated CATE (probability points)") ax.set_ylabel("Customers") ax.legend() plt.tight_layout() plt.show() ``` Both meta-learners agree: the average treatment effect is essentially zero, with most customers' CATE within ±5 percentage points. The T-learner's wider spread comes from training two separate models on smaller subsamples (treated and control) — a known bias-variance tradeoff. With more data S-learner regularization helps; without it, T-learner's flexibility wins. ## A sanity check — can the technique recover a real effect? Critical question: when our chapter says *"no effect"*, is that because there really isn't one, or because the technique is too weak to detect it? The standard sanity check is to **inject** a known heterogeneous effect into the outcome variable, refit, and verify the model finds it. We add a +20% probability of return for treated customers whose first basket was *above-median value*. Treated customers with low-value baskets see no boost. Truth: high-value treated CATE = +0.20, low-value treated CATE = 0. ```{python} #| label: inject-effect rng = np.random.default_rng(RANDOM_STATE) Y_synth = Y.copy().astype(float) high_value = data["first_basket_value"].values > np.median(data["first_basket_value"].values) boost_mask = (T == 1) & high_value boosted = (rng.random(boost_mask.sum()) < 0.20).astype(int) Y_synth[boost_mask] = np.clip(Y_synth[boost_mask] + boosted, 0, 1) cate_inj = t_learner(X, T, Y_synth.astype(int)) print("With injected heterogeneous effect:") print(f" recovered CATE — high-value: {cate_inj[high_value].mean():+.4f} (truth: +0.20)") print(f" recovered CATE — low-value: {cate_inj[~high_value].mean():+.4f} (truth: 0.00)") print(f" recovered ATE: {cate_inj.mean():+.4f} (truth: +0.10 averaged)") ``` ```{python} #| label: fig-injection-recovery #| fig-cap: "T-learner recovers the structure of an injected heterogeneous effect. High-value customers (truth +0.20) get pushed in the right direction; low-value customers (truth 0) stay near zero." fig, ax = plt.subplots(figsize=(9, 4.5)) ax.hist(cate_inj[high_value], bins=30, alpha=0.6, color="#c0392b", label="High-value first basket") ax.hist(cate_inj[~high_value], bins=30, alpha=0.6, color="#27ae60", label="Low-value first basket") ax.axvline(0, color="grey", ls=":", lw=1) ax.axvline(0.20, color="#c0392b", ls="--", lw=1, label="True high-value effect (+0.20)") ax.set_xlabel("Estimated CATE (probability points)") ax.set_ylabel("Customers") ax.legend() plt.tight_layout() plt.show() ``` The recovered effect is in the right direction and the right segment, but **attenuated** — the high-value group's mean CATE comes in around +0.11, not +0.20. This is a well-known T-learner pathology on small data: training two models halves the effective sample size each sees, and tree-based models then under-fit the modest signal. Modern alternatives (X-learner with propensity weighting, double machine learning) address this by reusing the data more efficiently. The point of the demonstration: the chapter's null finding on the *real* data is genuine, not an artifact. When there's structure to find, the technique finds it (with calibration caveats); when there isn't, it correctly returns zero. ## When does this work? | Scenario | Naive ATE valid? | Meta-learners valid? | |---|---|---| | **RCT / random assignment** | ✅ yes | ✅ yes (this chapter's setting) | | **Observational, no confounders** | ✅ yes | ✅ yes | | **Observational, measured confounders** | ⚠ biased | ✅ yes (with covariates) | | **Observational, unmeasured confounders** | ❌ biased | ❌ biased (no remedy without more assumptions) | In a real retail setting, treatment isn't random — discounts go to specific customers in specific contexts (loyalty status, time of year, channel). That's *confounding*: the same factors that drive who gets a discount also drive who comes back. Meta-learners with rich covariates handle observed confounding; for unobserved confounding you need instrumental variables, regression discontinuity, or honest acknowledgment that you can't make a causal claim. ## How this chapter fits the rest | Chapter | Type | Output | |---|---|---| | 04 — CLV | Predictive (per-customer) | Expected revenue, P(alive) | | 06 — Survival | Predictive (population & per-customer) | When does the return rate decay? | | 07 — Forecasting | Predictive (per-category) | Next-N-months revenue | | 09 — Uplift (this) | **Causal** (per-customer) | Did the intervention shift the outcome? | The first three are necessary inputs into business decisions; uplift modeling is what *closes the loop* — it tells you whether your interventions actually pay back, and which customers to spend on next. ## Limitations - **Random assignment is a luxury.** Real treatments are targeted. Almost any "did this campaign work?" analysis on observational data needs serious thought about confounding before causal claims are defensible. - **Binary treatment only.** Real campaigns have intensities (5% vs. 15% vs. 30% off). Continuous-treatment uplift is supported by `econml`'s `CausalForest` and similar but adds complexity. - **The injection demo isn't a substitute for a real RCT.** Sanity-checking with synthetic effects shows the *technique* works; it doesn't validate the *finding* on real data. - **Beyond meta-learners.** Modern causal ML stack: doubly-robust estimators (`DR-Learner`), Double Machine Learning (Chernozhukov et al.), Causal Forest (Wager & Athey). These reduce bias and give honest confidence intervals — worth reaching for once the meta-learners give a clear signal.