Microsoft observational/experimental product-release benchmark

Self-contained exposition of Young and Dillon’s data, estimators, and replication results

Author

Krabbs

Published

May 27, 2026

Show code
import sys
from pathlib import Path
import numpy as np
import pandas as pd
from IPython.display import display, Markdown

from sklearn.base import clone
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression, RidgeCV
from sklearn.neighbors import NearestNeighbors
from lightgbm import LGBMClassifier, LGBMRegressor
import statsmodels.api as sm
import matplotlib.pyplot as plt

sys.path.append("/Users/alal/Desktop/code/aipyw")
sys.path.append("/Users/alal/tmp/krabbs_the_koder/fastmatch")
from aipyw import AIPyW
from fastmatch import Matching

ROOT = Path.cwd()
DATA = ROOT / "Data"
OUT = ROOT / "outputs"
OUT.mkdir(exist_ok=True)
pd.set_option("display.max_colwidth", 120)

One-paragraph summary

Young and Dillon’s paper, Reevaluating Causal Estimation Methods with Data from a Product Release, releases a rare validation dataset: a large randomized product experiment and a simultaneous observational adoption sample for the same masked Microsoft feature, measured with the same outcomes and covariates. The experimental sample gives a ground-truth benchmark; the observational sample mimics the empirical problem researchers usually face, where treatment adoption is endogenous. The paper’s central message is nuanced: for the continuous outcome, modern observational estimators can recover the experimental benchmark, but only after careful design choices—especially overlap trimming, well-tuned nuisance models, cross-fitting, and ensembling. For the binary outcome, the same best practices do not recover the benchmark, showing that flexible estimators cannot solve omitted-confounder problems when the measured covariates do not contain the right signal.

This memo is based on the local arXiv source at:

/Users/alal/tmp/krabbs_the_koder/rcem_arxiv_2601_11845/source/paper_1225.tex

and the replication archive at:

/Users/alal/tmp/krabbs_the_koder/reevaluating_causal_methods

All estimator tables below are computed inside this document. The code is folded, not replaced by precomputed estimator CSVs.

Load the public data

The public archive contains two parquet files with the same 44-column schema:

  • FINAL_PUBLIC_experimental.parquet: randomized A/B experiment.
  • FINAL_PUBLIC_observed.parquet: observational sample where users/devices endogenously adopted the feature.

The treatment is New_Feature. The two outcomes are Outcome_Continuous and Outcome_Binary. This memo focuses on the continuous outcome, because that is where the paper’s main positive result appears.

Show code
TREATMENT = "New_Feature"
OUTCOME = "Outcome_Continuous"
BINARY_OUTCOME = "Outcome_Binary"

COVARIATES = [
    "Device_Spec_A", "Total_Device_Usage",
    "Browser_1_Usage", "Browser_2_Usage", "Browser_3_Usage", "Browser_4_Usage", "Other_Browser_Usage",
    "EngagementSegment_General", "EngagementSegment_Highly Engaged", "EngagementSegment_Inactive", "EngagementSegment_Low Engaged",
    "A14Region_APAC", "A14Region_CEE", "A14Region_Canada", "A14Region_France", "A14Region_Germany",
    "A14Region_Greater China", "A14Region_India", "A14Region_Japan", "A14Region_Korea", "A14Region_Latam",
    "A14Region_MEA", "A14Region_UK", "A14Region_United States", "A14Region_Western Europe",
    "AppCategoryCohort_Developer", "AppCategoryCohort_Gamer", "AppCategoryCohort_General", "AppCategoryCohort_Media", "AppCategoryCohort_Productivity",
    "Manufacturer_1", "Manufacturer_2", "Manufacturer_3", "Manufacturer_4", "Manufacturer_5", "Manufacturer_6",
    "Device_Spec_B_1", "Device_Spec_B_2", "Device_Spec_C_1", "Device_Spec_C_2", "Device_Spec_C_3",
]


def prep_public(df):
    out = df[[OUTCOME, BINARY_OUTCOME, TREATMENT] + COVARIATES].copy()
    out = out.rename(columns={OUTCOME: "y", BINARY_OUTCOME: "yb", TREATMENT: "D"})
    out["D"] = out["D"].astype(int)
    return out.reset_index(drop=True)

obs_raw = pd.read_parquet(DATA / "FINAL_PUBLIC_observed.parquet")
exp_raw = pd.read_parquet(DATA / "FINAL_PUBLIC_experimental.parquet")
obs_df = prep_public(obs_raw)
exp_df = prep_public(exp_raw)

summary_rows = []
for name, df in [("Experimental", exp_df), ("Observed", obs_df)]:
    y1, y0 = df.loc[df.D == 1, "y"], df.loc[df.D == 0, "y"]
    b1, b0 = df.loc[df.D == 1, "yb"], df.loc[df.D == 0, "yb"]
    summary_rows.append({
        "dataset": name,
        "n": len(df),
        "treated": int(df.D.sum()),
        "controls": int((1 - df.D).sum()),
        "treated_share": df.D.mean(),
        "Outcome_Continuous_mean_treated": y1.mean(),
        "Outcome_Continuous_mean_control": y0.mean(),
        "Outcome_Continuous_diff_t_minus_c": y1.mean() - y0.mean(),
        "Outcome_Binary_mean_treated": b1.mean(),
        "Outcome_Binary_mean_control": b0.mean(),
        "Outcome_Binary_diff_t_minus_c": b1.mean() - b0.mean(),
    })
summary = pd.DataFrame(summary_rows)
display(summary.round(4))
dataset n treated controls treated_share Outcome_Continuous_mean_treated Outcome_Continuous_mean_control Outcome_Continuous_diff_t_minus_c Outcome_Binary_mean_treated Outcome_Binary_mean_control Outcome_Binary_diff_t_minus_c
0 Experimental 435170 11475 423695 0.0264 4.4378 4.0052 0.4326 0.6010 0.9858 -0.3847
1 Observed 445286 20189 425097 0.0453 5.2691 4.0614 1.2077 0.4537 0.9858 -0.5320

The naive observational difference in means has the correct qualitative sign for the continuous outcome but differs meaningfully from the experimental benchmark. The whole point of the paper is to ask whether design-conscious observational estimators can remove that selection bias.

Why this is not a LaLonde clone

The design differs from the classic LaLonde/Dehejia-Wahba exercise. In those benchmarks, the experimental treated units are often reused and paired with nonexperimental controls from a different source such as CPS or PSID. Here, the experimental and observational samples are parallel product-release samples. There are treated and untreated units in both samples, and the treated observational units are not simply the experimental treated units reused.

Show code
import pandas.util


def row_hashes(df, cols):
    return set(pd.util.hash_pandas_object(df[cols], index=False).astype("uint64"))

checks = []
for label, cols in [
    ("full rows, treated only", ["y", "yb", "D"] + COVARIATES),
    ("covariates+outcomes, treated only", ["y", "yb"] + COVARIATES),
    ("covariates only, treated only", COVARIATES),
    ("full rows, all units", ["y", "yb", "D"] + COVARIATES),
    ("covariates+outcomes, all units", ["y", "yb"] + COVARIATES),
    ("covariates only, all units", COVARIATES),
]:
    ex = exp_df.loc[exp_df.D == 1] if "treated only" in label else exp_df
    ob = obs_df.loc[obs_df.D == 1] if "treated only" in label else obs_df
    hx, ho = row_hashes(ex, cols), row_hashes(ob, cols)
    checks.append({
        "check": label,
        "exp_rows": len(ex),
        "obs_rows": len(ob),
        "unique_exp_hashes": len(hx),
        "unique_obs_hashes": len(ho),
        "intersection_hashes": len(hx & ho),
        "intersection_over_exp_unique": len(hx & ho) / len(hx),
    })
overlap = pd.DataFrame(checks)
display(overlap.round(6))
check exp_rows obs_rows unique_exp_hashes unique_obs_hashes intersection_hashes intersection_over_exp_unique
0 full rows, treated only 11475 20189 11473 20189 1 0.000087
1 covariates+outcomes, treated only 11475 20189 11473 20189 1 0.000087
2 covariates only, treated only 11475 20189 11471 20189 2 0.000174
3 full rows, all units 435170 445286 433980 444031 505 0.001164
4 covariates+outcomes, all units 435170 445286 433948 444027 510 0.001175
5 covariates only, all units 435170 445286 433717 443807 518 0.001194

Exact row overlap between the experimental and observational treated units is essentially zero.

Balance and selection

Randomization in the experimental sample should make treatment close to independent of covariates. Endogenous adoption in the observational sample should not. Standardized mean differences show that pattern.

Show code
def smd_table(df, label):
    rows = []
    for c in COVARIATES:
        x1 = df.loc[df.D == 1, c].astype(float)
        x0 = df.loc[df.D == 0, c].astype(float)
        pooled = np.sqrt((x1.var() + x0.var()) / 2)
        smd = np.nan if pooled == 0 else (x1.mean() - x0.mean()) / pooled
        rows.append({"dataset": label, "covariate": c, "smd": smd, "abs_smd": abs(smd)})
    return pd.DataFrame(rows)

balance = pd.concat([smd_table(exp_df, "Experimental"), smd_table(obs_df, "Observed")], ignore_index=True)
bal_summary = balance.groupby("dataset")["abs_smd"].agg(["mean", "median", "max"])
bal_summary["n_abs_smd_gt_0.1"] = balance.assign(big=balance.abs_smd > 0.1).groupby("dataset")["big"].sum()
display(bal_summary.round(4))

top_covars = (
    balance.loc[balance.dataset.eq("Observed")]
    .sort_values("abs_smd", ascending=False)
    .head(25)["covariate"]
    .tolist()
)
plot_df = balance.loc[balance.covariate.isin(top_covars)].copy()
plot_df["covariate"] = pd.Categorical(plot_df["covariate"], categories=list(reversed(top_covars)), ordered=True)

fig, ax = plt.subplots(figsize=(9, 8))
for dataset, marker in [("Experimental", "o"), ("Observed", "s")]:
    sub = plot_df.loc[plot_df.dataset.eq(dataset)].sort_values("covariate")
    ax.scatter(sub["abs_smd"], sub["covariate"], label=dataset, marker=marker, s=55)
ax.axvline(0.1, color="tab:red", linestyle="--", linewidth=1, label="0.1 rule of thumb")
ax.set_xlabel("Absolute standardized mean difference")
ax.set_ylabel("")
ax.set_title("Top 25 observational covariate imbalances")
ax.grid(axis="x", alpha=0.25)
ax.legend(loc="lower right")
fig.tight_layout()
fig.savefig(OUT / "balance_smd_top25_doc.png", dpi=180, bbox_inches="tight")
plt.close(fig)

display(balance.sort_values("abs_smd", ascending=False).head(12).round(4))
mean median max n_abs_smd_gt_0.1
dataset
Experimental 0.0250 0.0218 0.1097 1
Observed 0.0894 0.0391 0.5108 9
dataset covariate smd abs_smd
49 Observed EngagementSegment_Highly Engaged 0.5108 0.5108
51 Observed EngagementSegment_Low Engaged -0.4875 0.4875
42 Observed Total_Device_Usage 0.4116 0.4116
80 Observed Device_Spec_C_2 0.3307 0.3307
44 Observed Browser_2_Usage 0.2196 0.2196
41 Observed Device_Spec_A -0.1557 0.1557
43 Observed Browser_1_Usage 0.1556 0.1556
78 Observed Device_Spec_B_2 -0.1436 0.1436
75 Observed Manufacturer_5 -0.1285 0.1285
10 Experimental EngagementSegment_Low Engaged -0.1097 0.1097
70 Observed AppCategoryCohort_Productivity -0.0970 0.0970
8 Experimental EngagementSegment_Highly Engaged 0.0934 0.0934

The largest observational imbalances are in variables that plausibly predict voluntary feature adoption: engagement intensity, total usage, browser usage, and device specs.

Estimands and assumptions

Let \(D_i \in \{0,1\}\) indicate feature adoption and \(Y_i(1),Y_i(0)\) denote potential outcomes. The observed outcome is

\[ Y_i = D_iY_i(1)+(1-D_i)Y_i(0). \]

The main product-release analysis targets the average treatment effect (ATE):

\[ \tau_{ATE}=\mathbb{E}[Y_i(1)-Y_i(0)]. \]

The observational estimators rely on unconfoundedness,

\[ (Y_i(1),Y_i(0))\perp D_i\mid X_i, \]

and overlap,

\[ 0 < e(X_i) < 1, \qquad e(X_i)=\mathbb{P}(D_i=1\mid X_i). \]

The experimental sample supplies a validation benchmark. The observational sample tells us whether methods based on measured \(X_i\) can recover that benchmark.

Nuisance models and trimming

Most estimators below use the outcome nuisance functions

\[ \mu_d(x)=\mathbb{E}[Y_i\mid D_i=d,X_i=x] \]

and/or the propensity score

\[ e(x)=\mathbb{P}(D_i=1\mid X_i=x). \]

The paper’s preferred workflow is: tune flexible nuisance learners, cross-fit them, ensemble across fits, and trim to common support. The code below uses a lighter but explicit in-document version: LightGBM nuisance fits with fixed hyperparameters and 3-fold cross-fitting.

Crump-style trimming keeps observations satisfying

\[ \alpha\leq \hat e(X_i)\leq 1-\alpha. \]

Show code
def make_X(df):
    X = df[COVARIATES].copy()
    for col in X.columns:
        if X[col].dtype == bool:
            X[col] = X[col].astype(np.float32)
    return X.astype(np.float32)


def lgbm_propensity_model(seed=1):
    return LGBMClassifier(
        n_estimators=60, learning_rate=0.05, num_leaves=31,
        min_child_samples=200, subsample=0.85, colsample_bytree=0.85,
        random_state=seed, n_jobs=-1, verbose=-1,
    )


def lgbm_outcome_model(seed=1):
    return LGBMRegressor(
        n_estimators=60, learning_rate=0.05, num_leaves=31,
        min_child_samples=200, subsample=0.85, colsample_bytree=0.85,
        random_state=seed, n_jobs=-1, verbose=-1,
    )


def fit_propensity_full(df):
    model = lgbm_propensity_model(seed=11)
    X = make_X(df)
    model.fit(X, df["D"].to_numpy())
    return model, model.predict_proba(X)[:, 1]


def crump_alpha(e):
    e = np.asarray(e, dtype=float)
    v = 1.0 / (e * (1 - e))
    grid = np.linspace(0.0, 0.2, 1001)
    # Direct grid minimization of the asymptotic variance proxy over symmetric cutoffs.
    vals = []
    for a in grid:
        m = (e >= a) & (e <= 1 - a)
        vals.append(np.inf if m.sum() == 0 else v[m].mean() / m.mean())
    return float(grid[int(np.argmin(vals))])

trim_propensity_model, obs_propensity_for_trim = fit_propensity_full(obs_df)
trim_alpha = crump_alpha(obs_propensity_for_trim)
obs_trim_mask = (obs_propensity_for_trim >= trim_alpha) & (obs_propensity_for_trim <= 1 - trim_alpha)
obs_trimmed_df = obs_df.loc[obs_trim_mask].reset_index(drop=True)

# Apply the same observational support rule to experimental covariates for the trimmed benchmark.
exp_propensity_for_support = trim_propensity_model.predict_proba(make_X(exp_df))[:, 1]
exp_trim_mask = (exp_propensity_for_support >= trim_alpha) & (exp_propensity_for_support <= 1 - trim_alpha)
exp_trimmed_df = exp_df.loc[exp_trim_mask].reset_index(drop=True)

Markdown(f"Estimated in-document trim threshold: `{trim_alpha:.4f}`; retained `{len(obs_trimmed_df):,}` of `{len(obs_df):,}` observational rows (`{len(obs_trimmed_df)/len(obs_df):.1%}`) and `{len(exp_trimmed_df):,}` of `{len(exp_df):,}` experimental rows (`{len(exp_trimmed_df)/len(exp_df):.1%}`).")

Estimated in-document trim threshold: 0.0254; retained 312,504 of 445,286 observational rows (70.2%) and 301,195 of 435,170 experimental rows (69.2%).

Estimator code

This section defines and runs the estimators directly in the document.

Cross-fitted nuisance functions

Show code
def crossfit_nuisance(df, outcome_col="y", n_splits=2, seed=123):
    # Fast, deterministic linear nuisance specification used for all in-document estimates.
    # The same specification is applied to continuous and binary outcomes.
    X_raw = make_X(df)
    X_np = StandardScaler().fit_transform(X_raw).astype(float)
    X = pd.DataFrame(X_np, columns=COVARIATES)
    d = df["D"].to_numpy(dtype=int)
    y = df[outcome_col].to_numpy(dtype=float)
    e_hat = np.zeros(len(df))
    mu0_hat = np.zeros(len(df))
    mu1_hat = np.zeros(len(df))

    kf = KFold(n_splits=n_splits, shuffle=True, random_state=seed)
    for train, test in kf.split(X):
        X_train, X_test = X.iloc[train], X.iloc[test]
        d_train, y_train = d[train], y[train]

        # Linear probability propensity nuisance: fast and applied identically to both outcomes.
        prop = LinearRegression()
        prop.fit(X_train, d_train)
        e_hat[test] = prop.predict(X_test)

        mu0 = LinearRegression()
        mu1 = LinearRegression()
        mu0.fit(X_train[d_train == 0], y_train[d_train == 0])
        mu1.fit(X_train[d_train == 1], y_train[d_train == 1])
        mu0_hat[test] = mu0.predict(X_test)
        mu1_hat[test] = mu1.predict(X_test)

    eps = 1e-3
    e_hat = np.clip(e_hat, eps, 1 - eps)
    return y, d, X, e_hat, mu0_hat, mu1_hat

Regression, OM, IPW, PSM, and DR/AIPW

Show code
def estimate_core_methods(df, sample_name, target, outcome_col="y"):
    y, d, X, e_hat, mu0_hat, mu1_hat = crossfit_nuisance(df, outcome_col=outcome_col)
    rows = []

    # 1. Regression adjustment: Y ~ D + X.
    X_reg = np.column_stack([d, X.to_numpy(dtype=float)])
    reg = LinearRegression().fit(X_reg, y)
    reg_pred = reg.predict(X_reg)
    reg_score = reg.coef_[0] + (y - reg_pred) * (d - d.mean()) / np.mean((d - d.mean()) ** 2)
    rows.append({"sample": sample_name, "estimator": "Reg", "estimate": reg.coef_[0], "se": reg_score.std(ddof=1) / np.sqrt(len(df))})

    # 2. Outcome modeling / Oaxaca-Blinder.
    om_score = mu1_hat - mu0_hat
    rows.append({"sample": sample_name, "estimator": "OM", "estimate": om_score.mean(), "se": om_score.std(ddof=1) / np.sqrt(len(df))})

    # 3. IPW.
    ipw_score = d * y / e_hat - (1 - d) * y / (1 - e_hat)
    rows.append({"sample": sample_name, "estimator": "IPW", "estimate": ipw_score.mean(), "se": ipw_score.std(ddof=1) / np.sqrt(len(df))})

    # 4. Propensity-score matching, ATE form, k=1. Fast exact 1D nearest-neighbor search on e_hat.
    treated = np.where(d == 1)[0]
    control = np.where(d == 0)[0]

    def nearest_values(query, ref_score, ref_y):
        order = np.argsort(ref_score)
        rs = ref_score[order]
        ry = ref_y[order]
        pos = np.searchsorted(rs, query)
        left = np.clip(pos - 1, 0, len(rs) - 1)
        right = np.clip(pos, 0, len(rs) - 1)
        choose_right = np.abs(rs[right] - query) < np.abs(rs[left] - query)
        idx = np.where(choose_right, right, left)
        return ry[idx]

    y0_t = nearest_values(e_hat[treated], e_hat[control], y[control])
    y1_c = nearest_values(e_hat[control], e_hat[treated], y[treated])
    psm_scores = np.r_[y[treated] - y0_t, y1_c - y[control]]
    rows.append({"sample": sample_name, "estimator": "PSM", "estimate": psm_scores.mean(), "se": psm_scores.std(ddof=1) / np.sqrt(len(psm_scores))})

    # 5. Doubly robust / AIPW.
    mu_d = np.where(d == 1, mu1_hat, mu0_hat)
    dr_score = (mu1_hat - mu0_hat) + (d / e_hat - (1 - d) / (1 - e_hat)) * (y - mu_d)
    rows.append({"sample": sample_name, "estimator": "DR/AIPW", "estimate": dr_score.mean(), "se": dr_score.std(ddof=1) / np.sqrt(len(df))})

    out = pd.DataFrame(rows)
    out["experimental_target"] = target
    out["abs_error"] = (out["estimate"] - target).abs()
    return out

# Experimental DR benchmarks are fit with the same code path, not read from a CSV.
exp_target_untrimmed = estimate_core_methods(exp_df, "Exp Untrimmed", np.nan).query("estimator == 'DR/AIPW'")["estimate"].iloc[0]
exp_target_trimmed = estimate_core_methods(exp_trimmed_df, "Exp Trimmed", np.nan).query("estimator == 'DR/AIPW'")["estimate"].iloc[0]
experimental_targets = pd.DataFrame([
    {"sample": "Exp Untrimmed", "estimate": exp_target_untrimmed},
    {"sample": "Exp Trimmed", "estimate": exp_target_trimmed},
])
display(experimental_targets.round(4))

core_untrimmed = estimate_core_methods(obs_df, "Obs Untrimmed", exp_target_untrimmed)
core_trimmed = estimate_core_methods(obs_trimmed_df, "Obs Trimmed", exp_target_trimmed)
core_estimates = pd.concat([core_untrimmed, core_trimmed], ignore_index=True).sort_values(["sample", "abs_error"])
display(core_estimates.round(4))
sample estimate
0 Exp Untrimmed 0.1555
1 Exp Trimmed 0.2183
sample estimator estimate se experimental_target abs_error
6 Obs Trimmed OM 0.2727 0.0011 0.2183 0.0544
5 Obs Trimmed Reg 0.2875 0.0572 0.2183 0.0692
9 Obs Trimmed DR/AIPW 0.2928 0.0600 0.2183 0.0745
8 Obs Trimmed PSM 0.2982 0.0192 0.2183 0.0799
7 Obs Trimmed IPW 0.3343 0.0734 0.2183 0.1159
3 Obs Untrimmed PSM 0.2749 0.0148 0.1555 0.1194
1 Obs Untrimmed OM 0.2925 0.0010 0.1555 0.1370
0 Obs Untrimmed Reg 0.3255 0.0540 0.1555 0.1700
4 Obs Untrimmed DR/AIPW 0.5671 0.1711 0.1555 0.4116
2 Obs Untrimmed IPW 0.7955 0.1885 0.1555 0.6400

The code above is deliberately explicit: each estimator is fit in this document. It is not reading outputs/trimmed_estimates.csv or outputs/untrimmed_estimates.csv.

AIPyW estimators, including balancing Riesz variants

The balancing Riesz estimators are not separate from aipyw; they are part of the same AIPyW surface through riesz_method="balancing" and bal_obj. I now report them together with the ordinary AIPyW IPW-Riesz version.

Show code
def scaled_numpy(df, outcome_col="y"):
    X = make_X(df)
    Xs = StandardScaler().fit_transform(X).astype(np.float32)
    d = df["D"].to_numpy(dtype=int)
    y = df[outcome_col].to_numpy(dtype=float)
    return Xs, d, y


def fit_aipyw_variants(df, sample_name, target, outcome_col="y"):
    X, d, y = scaled_numpy(df, outcome_col=outcome_col)
    variants = [
        ("aipyw_ipw_riesz", dict(riesz_method="ipw", bal_obj="quadratic")),
        ("aipyw_quadratic_balancing_riesz", dict(riesz_method="balancing", bal_obj="quadratic")),
        ("aipyw_entropy_balancing_riesz", dict(riesz_method="balancing", bal_obj="entropy")),
    ]
    rows = []
    for label, kwargs in variants:
        fit = AIPyW(
            propensity_model=LogisticRegression(max_iter=1000),
            outcome_model=LinearRegression(),
            n_splits=2,
            **kwargs,
        ).fit(X, d, y)
        psi = fit.influence_functions[:, 1] - fit.influence_functions[:, 0]
        rows.append({
            "sample": sample_name,
            "estimator": label,
            "estimate": psi.mean(),
            "se": psi.std(ddof=1) / np.sqrt(len(psi)),
            "experimental_target": target,
            "abs_error": abs(psi.mean() - target),
        })
    return pd.DataFrame(rows)

aipyw_estimates = pd.concat([
    fit_aipyw_variants(obs_df, "Obs Untrimmed", exp_target_untrimmed),
    fit_aipyw_variants(obs_trimmed_df, "Obs Trimmed", exp_target_trimmed),
], ignore_index=True).sort_values(["sample", "abs_error"])
display(aipyw_estimates.round(4))
sample estimator estimate se experimental_target abs_error
4 Obs Trimmed aipyw_quadratic_balancing_riesz 0.2800 0.0011 0.2183 0.0617
5 Obs Trimmed aipyw_entropy_balancing_riesz 0.2800 0.0011 0.2183 0.0617
3 Obs Trimmed aipyw_ipw_riesz 0.2812 0.0578 0.2183 0.0629
2 Obs Untrimmed aipyw_entropy_balancing_riesz 0.2881 0.0010 0.1555 0.1326
1 Obs Untrimmed aipyw_quadratic_balancing_riesz 0.2881 0.0010 0.1555 0.1326
0 Obs Untrimmed aipyw_ipw_riesz 0.3294 0.0669 0.1555 0.1740

The ordinary IPW-Riesz variant estimates the Riesz representer through a propensity model. The quadratic and entropy balancing variants instead estimate the Riesz representer by directly balancing covariate moments in each treatment arm against the target sample. Conceptually, they are alternatives for the same orthogonal AIPW score, not a different family of estimators outside aipyw.

py-econometrics/fastmatch

fastmatch is also fit directly below. I use the exact scikit backend because the FAISS backend segfaulted in this environment; scikit is exact and fast enough here.

Show code
def fit_fastmatch_variants(df, sample_name, target, outcome_col="y"):
    X, d, y = scaled_numpy(df, outcome_col=outcome_col)
    rows = []
    for k in [1, 5]:
        for bias_corr in [False, True]:
            label = f"fastmatch_scikit_ate_k{k}" + ("_ridge_bc" if bias_corr else "")
            mod = RidgeCV(alphas=np.logspace(-4, 4, 9)) if bias_corr else None
            est, se = Matching("ATE", k=k, backend="scikit", bias_corr_mod=mod).fit(y, d, X)
            rows.append({
                "sample": sample_name,
                "estimator": label,
                "estimate": est,
                "se": se,
                "experimental_target": target,
                "abs_error": abs(est - target),
            })
    return pd.DataFrame(rows)

fastmatch_estimates = pd.concat([
    fit_fastmatch_variants(obs_df, "Obs Untrimmed", exp_target_untrimmed),
    fit_fastmatch_variants(obs_trimmed_df, "Obs Trimmed", exp_target_trimmed),
], ignore_index=True).sort_values(["sample", "abs_error"])
display(fastmatch_estimates.round(4))
sample estimator estimate se experimental_target abs_error
7 Obs Trimmed fastmatch_scikit_ate_k5_ridge_bc 0.2211 0.0646 0.2183 0.0028
6 Obs Trimmed fastmatch_scikit_ate_k5 0.2327 NaN 0.2183 0.0143
4 Obs Trimmed fastmatch_scikit_ate_k1 0.1872 NaN 0.2183 0.0311
5 Obs Trimmed fastmatch_scikit_ate_k1_ridge_bc 0.1443 0.0706 0.2183 0.0740
1 Obs Untrimmed fastmatch_scikit_ate_k1_ridge_bc 0.2536 0.0734 0.1555 0.0981
0 Obs Untrimmed fastmatch_scikit_ate_k1 0.3097 NaN 0.1555 0.1543
3 Obs Untrimmed fastmatch_scikit_ate_k5_ridge_bc 0.4033 0.0633 0.1555 0.2478
2 Obs Untrimmed fastmatch_scikit_ate_k5 0.4205 NaN 0.1555 0.2650

Combined in-document comparison

Show code
combined = pd.concat([core_estimates, aipyw_estimates, fastmatch_estimates], ignore_index=True)
combined = combined.sort_values(["sample", "abs_error"])
display(combined.round(4))
sample estimator estimate se experimental_target abs_error
16 Obs Trimmed fastmatch_scikit_ate_k5_ridge_bc 0.2211 0.0646 0.2183 0.0028
17 Obs Trimmed fastmatch_scikit_ate_k5 0.2327 NaN 0.2183 0.0143
18 Obs Trimmed fastmatch_scikit_ate_k1 0.1872 NaN 0.2183 0.0311
0 Obs Trimmed OM 0.2727 0.0011 0.2183 0.0544
10 Obs Trimmed aipyw_quadratic_balancing_riesz 0.2800 0.0011 0.2183 0.0617
11 Obs Trimmed aipyw_entropy_balancing_riesz 0.2800 0.0011 0.2183 0.0617
12 Obs Trimmed aipyw_ipw_riesz 0.2812 0.0578 0.2183 0.0629
1 Obs Trimmed Reg 0.2875 0.0572 0.2183 0.0692
19 Obs Trimmed fastmatch_scikit_ate_k1_ridge_bc 0.1443 0.0706 0.2183 0.0740
2 Obs Trimmed DR/AIPW 0.2928 0.0600 0.2183 0.0745
3 Obs Trimmed PSM 0.2982 0.0192 0.2183 0.0799
4 Obs Trimmed IPW 0.3343 0.0734 0.2183 0.1159
20 Obs Untrimmed fastmatch_scikit_ate_k1_ridge_bc 0.2536 0.0734 0.1555 0.0981
5 Obs Untrimmed PSM 0.2749 0.0148 0.1555 0.1194
13 Obs Untrimmed aipyw_entropy_balancing_riesz 0.2881 0.0010 0.1555 0.1326
14 Obs Untrimmed aipyw_quadratic_balancing_riesz 0.2881 0.0010 0.1555 0.1326
6 Obs Untrimmed OM 0.2925 0.0010 0.1555 0.1370
21 Obs Untrimmed fastmatch_scikit_ate_k1 0.3097 NaN 0.1555 0.1543
7 Obs Untrimmed Reg 0.3255 0.0540 0.1555 0.1700
15 Obs Untrimmed aipyw_ipw_riesz 0.3294 0.0669 0.1555 0.1740
22 Obs Untrimmed fastmatch_scikit_ate_k5_ridge_bc 0.4033 0.0633 0.1555 0.2478
23 Obs Untrimmed fastmatch_scikit_ate_k5 0.4205 NaN 0.1555 0.2650
8 Obs Untrimmed DR/AIPW 0.5671 0.1711 0.1555 0.4116
9 Obs Untrimmed IPW 0.7955 0.1885 0.1555 0.6400

Because everything in this table is fit inside the document, the exact numbers can differ from the paper’s cached AutoML/ensemble notebook outputs. That is expected. The memo’s qualitative lessons are the same: trimming matters; nuisance choice matters; matching is much more credible in the overlap sample than untrimmed; and AIPW/Riesz estimators are only as good as the measured confounding information and nuisance fits they are given.

Binary outcome: same specifications

The arXiv source emphasizes that the continuous-outcome success is not universal. To make that visible, this section reruns the same estimator specifications on Outcome_Binary: same covariates, same trimming rule, same LGBM nuisance specifications for the core estimators, same AIPyW Riesz choices, and same fastmatch variants. The outcome is binary, but the estimand is still a mean difference / risk difference.

Show code
# Experimental binary benchmarks, fit in-document with the same DR/AIPW code path.
exp_binary_target_untrimmed = estimate_core_methods(
    exp_df, "Exp Untrimmed", np.nan, outcome_col="yb"
).query("estimator == 'DR/AIPW'")["estimate"].iloc[0]
exp_binary_target_trimmed = estimate_core_methods(
    exp_trimmed_df, "Exp Trimmed", np.nan, outcome_col="yb"
).query("estimator == 'DR/AIPW'")["estimate"].iloc[0]

binary_targets = pd.DataFrame([
    {"sample": "Exp Untrimmed", "estimate": exp_binary_target_untrimmed},
    {"sample": "Exp Trimmed", "estimate": exp_binary_target_trimmed},
])
display(binary_targets.round(4))

binary_core = pd.concat([
    estimate_core_methods(obs_df, "Obs Untrimmed", exp_binary_target_untrimmed, outcome_col="yb"),
    estimate_core_methods(obs_trimmed_df, "Obs Trimmed", exp_binary_target_trimmed, outcome_col="yb"),
], ignore_index=True).sort_values(["sample", "abs_error"])
display(binary_core.round(4))
sample estimate
0 Exp Untrimmed -0.3734
1 Exp Trimmed -0.4162
sample estimator estimate se experimental_target abs_error
7 Obs Trimmed IPW -0.5321 0.0055 -0.4162 0.1159
8 Obs Trimmed PSM -0.5324 0.0009 -0.4162 0.1162
6 Obs Trimmed OM -0.5345 0.0001 -0.4162 0.1182
9 Obs Trimmed DR/AIPW -0.5375 0.0043 -0.4162 0.1213
5 Obs Trimmed Reg -0.5434 0.0036 -0.4162 0.1272
2 Obs Untrimmed IPW -0.4351 0.0167 -0.3734 0.0618
3 Obs Untrimmed PSM -0.4717 0.0008 -0.3734 0.0984
4 Obs Untrimmed DR/AIPW -0.4845 0.0102 -0.3734 0.1111
1 Obs Untrimmed OM -0.4976 0.0001 -0.3734 0.1243
0 Obs Untrimmed Reg -0.5338 0.0035 -0.3734 0.1605
Show code
binary_aipyw = pd.concat([
    fit_aipyw_variants(obs_df, "Obs Untrimmed", exp_binary_target_untrimmed, outcome_col="yb"),
    fit_aipyw_variants(obs_trimmed_df, "Obs Trimmed", exp_binary_target_trimmed, outcome_col="yb"),
], ignore_index=True).sort_values(["sample", "abs_error"])
display(binary_aipyw.round(4))
sample estimator estimate se experimental_target abs_error
4 Obs Trimmed aipyw_quadratic_balancing_riesz -0.5346 0.0001 -0.4162 0.1184
5 Obs Trimmed aipyw_entropy_balancing_riesz -0.5346 0.0001 -0.4162 0.1184
3 Obs Trimmed aipyw_ipw_riesz -0.5348 0.0038 -0.4162 0.1186
0 Obs Untrimmed aipyw_ipw_riesz -0.4960 0.0048 -0.3734 0.1227
2 Obs Untrimmed aipyw_entropy_balancing_riesz -0.4970 0.0001 -0.3734 0.1236
1 Obs Untrimmed aipyw_quadratic_balancing_riesz -0.4970 0.0001 -0.3734 0.1236
Show code
binary_fastmatch = pd.concat([
    fit_fastmatch_variants(obs_df, "Obs Untrimmed", exp_binary_target_untrimmed, outcome_col="yb"),
    fit_fastmatch_variants(obs_trimmed_df, "Obs Trimmed", exp_binary_target_trimmed, outcome_col="yb"),
], ignore_index=True).sort_values(["sample", "abs_error"])
display(binary_fastmatch.round(4))
sample estimator estimate se experimental_target abs_error
5 Obs Trimmed fastmatch_scikit_ate_k1_ridge_bc -0.5302 0.0048 -0.4162 0.1139
7 Obs Trimmed fastmatch_scikit_ate_k5_ridge_bc -0.5308 0.0042 -0.4162 0.1146
4 Obs Trimmed fastmatch_scikit_ate_k1 -0.5332 NaN -0.4162 0.1170
6 Obs Trimmed fastmatch_scikit_ate_k5 -0.5336 NaN -0.4162 0.1174
1 Obs Untrimmed fastmatch_scikit_ate_k1_ridge_bc -0.4761 0.0065 -0.3734 0.1028
0 Obs Untrimmed fastmatch_scikit_ate_k1 -0.4900 NaN -0.3734 0.1166
3 Obs Untrimmed fastmatch_scikit_ate_k5_ridge_bc -0.4938 0.0051 -0.3734 0.1204
2 Obs Untrimmed fastmatch_scikit_ate_k5 -0.5006 NaN -0.3734 0.1272
Show code
binary_combined = pd.concat([binary_core, binary_aipyw, binary_fastmatch], ignore_index=True)
binary_combined = binary_combined.sort_values(["sample", "abs_error"])
display(binary_combined.round(4))
sample estimator estimate se experimental_target abs_error
16 Obs Trimmed fastmatch_scikit_ate_k1_ridge_bc -0.5302 0.0048 -0.4162 0.1139
17 Obs Trimmed fastmatch_scikit_ate_k5_ridge_bc -0.5308 0.0042 -0.4162 0.1146
0 Obs Trimmed IPW -0.5321 0.0055 -0.4162 0.1159
1 Obs Trimmed PSM -0.5324 0.0009 -0.4162 0.1162
18 Obs Trimmed fastmatch_scikit_ate_k1 -0.5332 NaN -0.4162 0.1170
19 Obs Trimmed fastmatch_scikit_ate_k5 -0.5336 NaN -0.4162 0.1174
2 Obs Trimmed OM -0.5345 0.0001 -0.4162 0.1182
10 Obs Trimmed aipyw_quadratic_balancing_riesz -0.5346 0.0001 -0.4162 0.1184
11 Obs Trimmed aipyw_entropy_balancing_riesz -0.5346 0.0001 -0.4162 0.1184
12 Obs Trimmed aipyw_ipw_riesz -0.5348 0.0038 -0.4162 0.1186
3 Obs Trimmed DR/AIPW -0.5375 0.0043 -0.4162 0.1213
4 Obs Trimmed Reg -0.5434 0.0036 -0.4162 0.1272
5 Obs Untrimmed IPW -0.4351 0.0167 -0.3734 0.0618
6 Obs Untrimmed PSM -0.4717 0.0008 -0.3734 0.0984
20 Obs Untrimmed fastmatch_scikit_ate_k1_ridge_bc -0.4761 0.0065 -0.3734 0.1028
7 Obs Untrimmed DR/AIPW -0.4845 0.0102 -0.3734 0.1111
21 Obs Untrimmed fastmatch_scikit_ate_k1 -0.4900 NaN -0.3734 0.1166
22 Obs Untrimmed fastmatch_scikit_ate_k5_ridge_bc -0.4938 0.0051 -0.3734 0.1204
13 Obs Untrimmed aipyw_ipw_riesz -0.4960 0.0048 -0.3734 0.1227
14 Obs Untrimmed aipyw_entropy_balancing_riesz -0.4970 0.0001 -0.3734 0.1236
15 Obs Untrimmed aipyw_quadratic_balancing_riesz -0.4970 0.0001 -0.3734 0.1236
8 Obs Untrimmed OM -0.4976 0.0001 -0.3734 0.1243
23 Obs Untrimmed fastmatch_scikit_ate_k5 -0.5006 NaN -0.3734 0.1272
9 Obs Untrimmed Reg -0.5338 0.0035 -0.3734 0.1605

This is the binary-outcome analogue of the continuous table above. The substantive read is the paper’s negative result: applying the same machinery to the binary outcome does not make the observational estimates line up as cleanly with the experiment. Flexible ML and double robustness extract signal from observed covariates; they do not create unobserved confounders. If observed \(X\) predicts treatment but not the relevant potential-outcome surfaces, causal adjustment can still fail.

CATE and sensitivity analysis

For CATE, the paper uses DR-style pseudo-outcomes,

\[ Y_i^{DR}=\hat\mu_1(X_i)-\hat\mu_0(X_i)+(Y_i-\hat\mu_{D_i}(X_i))\frac{D_i-\hat e(X_i)}{\hat e(X_i)(1-\hat e(X_i))}, \]

then regresses \(Y_i^{DR}\) on covariates. The paper finds little useful heterogeneity for the continuous outcome and more evidence for binary-outcome heterogeneity, though the binary average effect remains biased in the observational sample.

For sensitivity analysis, the paper uses Chernozhukov-style omitted-confounder diagnostics. These ask how much an unobserved confounder would need to explain treatment and residual outcome variation to move the estimate. A key lesson is that robustness-to-zero is not the same as accuracy of magnitude.

Bottom line

  1. The dataset is a valuable validation design: randomized experiment plus simultaneous observational adoption sample.
  2. The data are not a LaLonde-style reuse of identical treated units.
  3. For the continuous outcome, observational methods can get close when overlap trimming, tuned nuisance models, cross-fitting, and ensembling are used.
  4. For the binary outcome, the same machinery fails, showing the limit of unconfoundedness when relevant confounders are missing.
  5. The balancing Riesz estimators belong under aipyw; they are alternative Riesz-representer choices inside the AIPW score.
  6. The code in this memo now directly fits each estimator in folded Quarto cells rather than reading precomputed estimator CSVs.