Microsoft observational/experimental product-release benchmark

Self-contained exposition of Young and Dillon’s data, estimators, and replication results

Author

Krabbs

Published

May 27, 2026

Show code

import sys
from pathlib import Path
import numpy as np
import pandas as pd
from IPython.display import display, Markdown

from sklearn.base import clone
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, LogisticRegression, RidgeCV
from sklearn.neighbors import NearestNeighbors
from lightgbm import LGBMClassifier, LGBMRegressor
import statsmodels.api as sm
import matplotlib.pyplot as plt

sys.path.append("/Users/alal/Desktop/code/aipyw")
sys.path.append("/Users/alal/tmp/krabbs_the_koder/fastmatch")
from aipyw import AIPyW
from fastmatch import Matching

ROOT = Path.cwd()
DATA = ROOT / "Data"
OUT = ROOT / "outputs"
OUT.mkdir(exist_ok=True)
pd.set_option("display.max_colwidth", 120)

One-paragraph summary

Young and Dillon’s paper, Reevaluating Causal Estimation Methods with Data from a Product Release, releases a rare validation dataset: a large randomized product experiment and a simultaneous observational adoption sample for the same masked Microsoft feature, measured with the same outcomes and covariates. The experimental sample gives a ground-truth benchmark; the observational sample mimics the empirical problem researchers usually face, where treatment adoption is endogenous. The paper’s central message is nuanced: for the continuous outcome, modern observational estimators can recover the experimental benchmark, but only after careful design choices—especially overlap trimming, well-tuned nuisance models, cross-fitting, and ensembling. For the binary outcome, the same best practices do not recover the benchmark, showing that flexible estimators cannot solve omitted-confounder problems when the measured covariates do not contain the right signal.

This memo is based on the local arXiv source at:

/Users/alal/tmp/krabbs_the_koder/rcem_arxiv_2601_11845/source/paper_1225.tex

and the replication archive at:

/Users/alal/tmp/krabbs_the_koder/reevaluating_causal_methods

All estimator tables below are computed inside this document. The code is folded, not replaced by precomputed estimator CSVs.

Load the public data

The public archive contains two parquet files with the same 44-column schema:

FINAL_PUBLIC_experimental.parquet: randomized A/B experiment.
FINAL_PUBLIC_observed.parquet: observational sample where users/devices endogenously adopted the feature.

The treatment is New_Feature. The two outcomes are Outcome_Continuous and Outcome_Binary. This memo focuses on the continuous outcome, because that is where the paper’s main positive result appears.

Show code

TREATMENT = "New_Feature"
OUTCOME = "Outcome_Continuous"
BINARY_OUTCOME = "Outcome_Binary"

COVARIATES = [
    "Device_Spec_A", "Total_Device_Usage",
    "Browser_1_Usage", "Browser_2_Usage", "Browser_3_Usage", "Browser_4_Usage", "Other_Browser_Usage",
    "EngagementSegment_General", "EngagementSegment_Highly Engaged", "EngagementSegment_Inactive", "EngagementSegment_Low Engaged",
    "A14Region_APAC", "A14Region_CEE", "A14Region_Canada", "A14Region_France", "A14Region_Germany",
    "A14Region_Greater China", "A14Region_India", "A14Region_Japan", "A14Region_Korea", "A14Region_Latam",
    "A14Region_MEA", "A14Region_UK", "A14Region_United States", "A14Region_Western Europe",
    "AppCategoryCohort_Developer", "AppCategoryCohort_Gamer", "AppCategoryCohort_General", "AppCategoryCohort_Media", "AppCategoryCohort_Productivity",
    "Manufacturer_1", "Manufacturer_2", "Manufacturer_3", "Manufacturer_4", "Manufacturer_5", "Manufacturer_6",
    "Device_Spec_B_1", "Device_Spec_B_2", "Device_Spec_C_1", "Device_Spec_C_2", "Device_Spec_C_3",
]


def prep_public(df):
    out = df[[OUTCOME, BINARY_OUTCOME, TREATMENT] + COVARIATES].copy()
    out = out.rename(columns={OUTCOME: "y", BINARY_OUTCOME: "yb", TREATMENT: "D"})
    out["D"] = out["D"].astype(int)
    return out.reset_index(drop=True)

obs_raw = pd.read_parquet(DATA / "FINAL_PUBLIC_observed.parquet")
exp_raw = pd.read_parquet(DATA / "FINAL_PUBLIC_experimental.parquet")
obs_df = prep_public(obs_raw)
exp_df = prep_public(exp_raw)

summary_rows = []
for name, df in [("Experimental", exp_df), ("Observed", obs_df)]:
    y1, y0 = df.loc[df.D == 1, "y"], df.loc[df.D == 0, "y"]
    b1, b0 = df.loc[df.D == 1, "yb"], df.loc[df.D == 0, "yb"]
    summary_rows.append({
        "dataset": name,
        "n": len(df),
        "treated": int(df.D.sum()),
        "controls": int((1 - df.D).sum()),
        "treated_share": df.D.mean(),
        "Outcome_Continuous_mean_treated": y1.mean(),
        "Outcome_Continuous_mean_control": y0.mean(),
        "Outcome_Continuous_diff_t_minus_c": y1.mean() - y0.mean(),
        "Outcome_Binary_mean_treated": b1.mean(),
        "Outcome_Binary_mean_control": b0.mean(),
        "Outcome_Binary_diff_t_minus_c": b1.mean() - b0.mean(),
    })
summary = pd.DataFrame(summary_rows)
display(summary.round(4))

	dataset	n	treated	controls	treated_share	Outcome_Continuous_mean_treated	Outcome_Continuous_mean_control	Outcome_Continuous_diff_t_minus_c	Outcome_Binary_mean_treated	Outcome_Binary_mean_control	Outcome_Binary_diff_t_minus_c
0	Experimental	435170	11475	423695	0.0264	4.4378	4.0052	0.4326	0.6010	0.9858	-0.3847
1	Observed	445286	20189	425097	0.0453	5.2691	4.0614	1.2077	0.4537	0.9858	-0.5320

The naive observational difference in means has the correct qualitative sign for the continuous outcome but differs meaningfully from the experimental benchmark. The whole point of the paper is to ask whether design-conscious observational estimators can remove that selection bias.

Why this is not a LaLonde clone

The design differs from the classic LaLonde/Dehejia-Wahba exercise. In those benchmarks, the experimental treated units are often reused and paired with nonexperimental controls from a different source such as CPS or PSID. Here, the experimental and observational samples are parallel product-release samples. There are treated and untreated units in both samples, and the treated observational units are not simply the experimental treated units reused.

Show code

import pandas.util


def row_hashes(df, cols):
    return set(pd.util.hash_pandas_object(df[cols], index=False).astype("uint64"))

checks = []
for label, cols in [
    ("full rows, treated only", ["y", "yb", "D"] + COVARIATES),
    ("covariates+outcomes, treated only", ["y", "yb"] + COVARIATES),
    ("covariates only, treated only", COVARIATES),
    ("full rows, all units", ["y", "yb", "D"] + COVARIATES),
    ("covariates+outcomes, all units", ["y", "yb"] + COVARIATES),
    ("covariates only, all units", COVARIATES),
]:
    ex = exp_df.loc[exp_df.D == 1] if "treated only" in label else exp_df
    ob = obs_df.loc[obs_df.D == 1] if "treated only" in label else obs_df
    hx, ho = row_hashes(ex, cols), row_hashes(ob, cols)
    checks.append({
        "check": label,
        "exp_rows": len(ex),
        "obs_rows": len(ob),
        "unique_exp_hashes": len(hx),
        "unique_obs_hashes": len(ho),
        "intersection_hashes": len(hx & ho),
        "intersection_over_exp_unique": len(hx & ho) / len(hx),
    })
overlap = pd.DataFrame(checks)
display(overlap.round(6))

	check	exp_rows	obs_rows	unique_exp_hashes	unique_obs_hashes	intersection_hashes	intersection_over_exp_unique
0	full rows, treated only	11475	20189	11473	20189	1	0.000087
1	covariates+outcomes, treated only	11475	20189	11473	20189	1	0.000087
2	covariates only, treated only	11475	20189	11471	20189	2	0.000174
3	full rows, all units	435170	445286	433980	444031	505	0.001164
4	covariates+outcomes, all units	435170	445286	433948	444027	510	0.001175
5	covariates only, all units	435170	445286	433717	443807	518	0.001194

Exact row overlap between the experimental and observational treated units is essentially zero.

Balance and selection

Randomization in the experimental sample should make treatment close to independent of covariates. Endogenous adoption in the observational sample should not. Standardized mean differences show that pattern.

Show code

def smd_table(df, label):
    rows = []
    for c in COVARIATES:
        x1 = df.loc[df.D == 1, c].astype(float)
        x0 = df.loc[df.D == 0, c].astype(float)
        pooled = np.sqrt((x1.var() + x0.var()) / 2)
        smd = np.nan if pooled == 0 else (x1.mean() - x0.mean()) / pooled
        rows.append({"dataset": label, "covariate": c, "smd": smd, "abs_smd": abs(smd)})
    return pd.DataFrame(rows)

balance = pd.concat([smd_table(exp_df, "Experimental"), smd_table(obs_df, "Observed")], ignore_index=True)
bal_summary = balance.groupby("dataset")["abs_smd"].agg(["mean", "median", "max"])
bal_summary["n_abs_smd_gt_0.1"] = balance.assign(big=balance.abs_smd > 0.1).groupby("dataset")["big"].sum()
display(bal_summary.round(4))

top_covars = (
    balance.loc[balance.dataset.eq("Observed")]
    .sort_values("abs_smd", ascending=False)
    .head(25)["covariate"]
    .tolist()
)
plot_df = balance.loc[balance.covariate.isin(top_covars)].copy()
plot_df["covariate"] = pd.Categorical(plot_df["covariate"], categories=list(reversed(top_covars)), ordered=True)

fig, ax = plt.subplots(figsize=(9, 8))
for dataset, marker in [("Experimental", "o"), ("Observed", "s")]:
    sub = plot_df.loc[plot_df.dataset.eq(dataset)].sort_values("covariate")
    ax.scatter(sub["abs_smd"], sub["covariate"], label=dataset, marker=marker, s=55)
ax.axvline(0.1, color="tab:red", linestyle="--", linewidth=1, label="0.1 rule of thumb")
ax.set_xlabel("Absolute standardized mean difference")
ax.set_ylabel("")
ax.set_title("Top 25 observational covariate imbalances")
ax.grid(axis="x", alpha=0.25)
ax.legend(loc="lower right")
fig.tight_layout()
fig.savefig(OUT / "balance_smd_top25_doc.png", dpi=180, bbox_inches="tight")
plt.close(fig)

display(balance.sort_values("abs_smd", ascending=False).head(12).round(4))

	mean	median	max	n_abs_smd_gt_0.1
dataset
Experimental	0.0250	0.0218	0.1097	1
Observed	0.0894	0.0391	0.5108	9

	dataset	covariate	smd	abs_smd
49	Observed	EngagementSegment_Highly Engaged	0.5108	0.5108
51	Observed	EngagementSegment_Low Engaged	-0.4875	0.4875
42	Observed	Total_Device_Usage	0.4116	0.4116
80	Observed	Device_Spec_C_2	0.3307	0.3307
44	Observed	Browser_2_Usage	0.2196	0.2196
41	Observed	Device_Spec_A	-0.1557	0.1557
43	Observed	Browser_1_Usage	0.1556	0.1556
78	Observed	Device_Spec_B_2	-0.1436	0.1436
75	Observed	Manufacturer_5	-0.1285	0.1285
10	Experimental	EngagementSegment_Low Engaged	-0.1097	0.1097
70	Observed	AppCategoryCohort_Productivity	-0.0970	0.0970
8	Experimental	EngagementSegment_Highly Engaged	0.0934	0.0934

The largest observational imbalances are in variables that plausibly predict voluntary feature adoption: engagement intensity, total usage, browser usage, and device specs.

Estimands and assumptions

Let \(D_i \in \{0,1\}\) indicate feature adoption and \(Y_i(1),Y_i(0)\) denote potential outcomes. The observed outcome is

\[ Y_i = D_iY_i(1)+(1-D_i)Y_i(0). \]

The main product-release analysis targets the average treatment effect (ATE):

\[ \tau_{ATE}=\mathbb{E}[Y_i(1)-Y_i(0)]. \]

The observational estimators rely on unconfoundedness,

\[ (Y_i(1),Y_i(0))\perp D_i\mid X_i, \]

and overlap,

\[ 0 < e(X_i) < 1, \qquad e(X_i)=\mathbb{P}(D_i=1\mid X_i). \]

The experimental sample supplies a validation benchmark. The observational sample tells us whether methods based on measured \(X_i\) can recover that benchmark.

Nuisance models and trimming

Most estimators below use the outcome nuisance functions

\[ \mu_d(x)=\mathbb{E}[Y_i\mid D_i=d,X_i=x] \]

and/or the propensity score

\[ e(x)=\mathbb{P}(D_i=1\mid X_i=x). \]

The paper’s preferred workflow is: tune flexible nuisance learners, cross-fit them, ensemble across fits, and trim to common support. The code below uses a lighter but explicit in-document version: LightGBM nuisance fits with fixed hyperparameters and 3-fold cross-fitting.

Crump-style trimming keeps observations satisfying

\[ \alpha\leq \hat e(X_i)\leq 1-\alpha. \]

Show code

def make_X(df):
    X = df[COVARIATES].copy()
    for col in X.columns:
        if X[col].dtype == bool:
            X[col] = X[col].astype(np.float32)
    return X.astype(np.float32)


def lgbm_propensity_model(seed=1):
    return LGBMClassifier(
        n_estimators=60, learning_rate=0.05, num_leaves=31,
        min_child_samples=200, subsample=0.85, colsample_bytree=0.85,
        random_state=seed, n_jobs=-1, verbose=-1,
    )


def lgbm_outcome_model(seed=1):
    return LGBMRegressor(
        n_estimators=60, learning_rate=0.05, num_leaves=31,
        min_child_samples=200, subsample=0.85, colsample_bytree=0.85,
        random_state=seed, n_jobs=-1, verbose=-1,
    )


def fit_propensity_full(df):
    model = lgbm_propensity_model(seed=11)
    X = make_X(df)
    model.fit(X, df["D"].to_numpy())
    return model, model.predict_proba(X)[:, 1]


def crump_alpha(e):
    e = np.asarray(e, dtype=float)
    v = 1.0 / (e * (1 - e))
    grid = np.linspace(0.0, 0.2, 1001)
    # Direct grid minimization of the asymptotic variance proxy over symmetric cutoffs.
    vals = []
    for a in grid:
        m = (e >= a) & (e <= 1 - a)
        vals.append(np.inf if m.sum() == 0 else v[m].mean() / m.mean())
    return float(grid[int(np.argmin(vals))])

trim_propensity_model, obs_propensity_for_trim = fit_propensity_full(obs_df)
trim_alpha = crump_alpha(obs_propensity_for_trim)
obs_trim_mask = (obs_propensity_for_trim >= trim_alpha) & (obs_propensity_for_trim <= 1 - trim_alpha)
obs_trimmed_df = obs_df.loc[obs_trim_mask].reset_index(drop=True)

# Apply the same observational support rule to experimental covariates for the trimmed benchmark.
exp_propensity_for_support = trim_propensity_model.predict_proba(make_X(exp_df))[:, 1]
exp_trim_mask = (exp_propensity_for_support >= trim_alpha) & (exp_propensity_for_support <= 1 - trim_alpha)
exp_trimmed_df = exp_df.loc[exp_trim_mask].reset_index(drop=True)

Markdown(f"Estimated in-document trim threshold: `{trim_alpha:.4f}`; retained `{len(obs_trimmed_df):,}` of `{len(obs_df):,}` observational rows (`{len(obs_trimmed_df)/len(obs_df):.1%}`) and `{len(exp_trimmed_df):,}` of `{len(exp_df):,}` experimental rows (`{len(exp_trimmed_df)/len(exp_df):.1%}`).")

Estimated in-document trim threshold: 0.0254; retained 312,504 of 445,286 observational rows (70.2%) and 301,195 of 435,170 experimental rows (69.2%).

Estimator code

This section defines and runs the estimators directly in the document.

Cross-fitted nuisance functions

Show code

def crossfit_nuisance(df, outcome_col="y", n_splits=2, seed=123):
    # Fast, deterministic linear nuisance specification used for all in-document estimates.
    # The same specification is applied to continuous and binary outcomes.
    X_raw = make_X(df)
    X_np = StandardScaler().fit_transform(X_raw).astype(float)
    X = pd.DataFrame(X_np, columns=COVARIATES)
    d = df["D"].to_numpy(dtype=int)
    y = df[outcome_col].to_numpy(dtype=float)
    e_hat = np.zeros(len(df))
    mu0_hat = np.zeros(len(df))
    mu1_hat = np.zeros(len(df))

    kf = KFold(n_splits=n_splits, shuffle=True, random_state=seed)
    for train, test in kf.split(X):
        X_train, X_test = X.iloc[train], X.iloc[test]
        d_train, y_train = d[train], y[train]

        # Linear probability propensity nuisance: fast and applied identically to both outcomes.
        prop = LinearRegression()
        prop.fit(X_train, d_train)
        e_hat[test] = prop.predict(X_test)

        mu0 = LinearRegression()
        mu1 = LinearRegression()
        mu0.fit(X_train[d_train == 0], y_train[d_train == 0])
        mu1.fit(X_train[d_train == 1], y_train[d_train == 1])
        mu0_hat[test] = mu0.predict(X_test)
        mu1_hat[test] = mu1.predict(X_test)

    eps = 1e-3
    e_hat = np.clip(e_hat, eps, 1 - eps)
    return y, d, X, e_hat, mu0_hat, mu1_hat

Regression, OM, IPW, PSM, and DR/AIPW

Show code

def estimate_core_methods(df, sample_name, target, outcome_col="y"):
    y, d, X, e_hat, mu0_hat, mu1_hat = crossfit_nuisance(df, outcome_col=outcome_col)
    rows = []

    # 1. Regression adjustment: Y ~ D + X.
    X_reg = np.column_stack([d, X.to_numpy(dtype=float)])
    reg = LinearRegression().fit(X_reg, y)
    reg_pred = reg.predict(X_reg)
    reg_score = reg.coef_[0] + (y - reg_pred) * (d - d.mean()) / np.mean((d - d.mean()) ** 2)
    rows.append({"sample": sample_name, "estimator": "Reg", "estimate": reg.coef_[0], "se": reg_score.std(ddof=1) / np.sqrt(len(df))})

    # 2. Outcome modeling / Oaxaca-Blinder.
    om_score = mu1_hat - mu0_hat
    rows.append({"sample": sample_name, "estimator": "OM", "estimate": om_score.mean(), "se": om_score.std(ddof=1) / np.sqrt(len(df))})

    # 3. IPW.
    ipw_score = d * y / e_hat - (1 - d) * y / (1 - e_hat)
    rows.append({"sample": sample_name, "estimator": "IPW", "estimate": ipw_score.mean(), "se": ipw_score.std(ddof=1) / np.sqrt(len(df))})

    # 4. Propensity-score matching, ATE form, k=1. Fast exact 1D nearest-neighbor search on e_hat.
    treated = np.where(d == 1)[0]
    control = np.where(d == 0)[0]

    def nearest_values(query, ref_score, ref_y):
        order = np.argsort(ref_score)
        rs = ref_score[order]
        ry = ref_y[order]
        pos = np.searchsorted(rs, query)
        left = np.clip(pos - 1, 0, len(rs) - 1)
        right = np.clip(pos, 0, len(rs) - 1)
        choose_right = np.abs(rs[right] - query) < np.abs(rs[left] - query)
        idx = np.where(choose_right, right, left)
        return ry[idx]

    y0_t = nearest_values(e_hat[treated], e_hat[control], y[control])
    y1_c = nearest_values(e_hat[control], e_hat[treated], y[treated])
    psm_scores = np.r_[y[treated] - y0_t, y1_c - y[control]]
    rows.append({"sample": sample_name, "estimator": "PSM", "estimate": psm_scores.mean(), "se": psm_scores.std(ddof=1) / np.sqrt(len(psm_scores))})

    # 5. Doubly robust / AIPW.
    mu_d = np.where(d == 1, mu1_hat, mu0_hat)
    dr_score = (mu1_hat - mu0_hat) + (d / e_hat - (1 - d) / (1 - e_hat)) * (y - mu_d)
    rows.append({"sample": sample_name, "estimator": "DR/AIPW", "estimate": dr_score.mean(), "se": dr_score.std(ddof=1) / np.sqrt(len(df))})

    out = pd.DataFrame(rows)
    out["experimental_target"] = target
    out["abs_error"] = (out["estimate"] - target).abs()
    return out

# Experimental DR benchmarks are fit with the same code path, not read from a CSV.
exp_target_untrimmed = estimate_core_methods(exp_df, "Exp Untrimmed", np.nan).query("estimator == 'DR/AIPW'")["estimate"].iloc[0]
exp_target_trimmed = estimate_core_methods(exp_trimmed_df, "Exp Trimmed", np.nan).query("estimator == 'DR/AIPW'")["estimate"].iloc[0]
experimental_targets = pd.DataFrame([
    {"sample": "Exp Untrimmed", "estimate": exp_target_untrimmed},
    {"sample": "Exp Trimmed", "estimate": exp_target_trimmed},
])
display(experimental_targets.round(4))

core_untrimmed = estimate_core_methods(obs_df, "Obs Untrimmed", exp_target_untrimmed)
core_trimmed = estimate_core_methods(obs_trimmed_df, "Obs Trimmed", exp_target_trimmed)
core_estimates = pd.concat([core_untrimmed, core_trimmed], ignore_index=True).sort_values(["sample", "abs_error"])
display(core_estimates.round(4))

	sample	estimate
0	Exp Untrimmed	0.1555
1	Exp Trimmed	0.2183

	sample	estimator	estimate	se	experimental_target	abs_error
6	Obs Trimmed	OM	0.2727	0.0011	0.2183	0.0544
5	Obs Trimmed	Reg	0.2875	0.0572	0.2183	0.0692
9	Obs Trimmed	DR/AIPW	0.2928	0.0600	0.2183	0.0745
8	Obs Trimmed	PSM	0.2982	0.0192	0.2183	0.0799
7	Obs Trimmed	IPW	0.3343	0.0734	0.2183	0.1159
3	Obs Untrimmed	PSM	0.2749	0.0148	0.1555	0.1194
1	Obs Untrimmed	OM	0.2925	0.0010	0.1555	0.1370
0	Obs Untrimmed	Reg	0.3255	0.0540	0.1555	0.1700
4	Obs Untrimmed	DR/AIPW	0.5671	0.1711	0.1555	0.4116
2	Obs Untrimmed	IPW	0.7955	0.1885	0.1555	0.6400

The code above is deliberately explicit: each estimator is fit in this document. It is not reading outputs/trimmed_estimates.csv or outputs/untrimmed_estimates.csv.

AIPyW estimators, including balancing Riesz variants

The balancing Riesz estimators are not separate from aipyw; they are part of the same AIPyW surface through riesz_method="balancing" and bal_obj. I now report them together with the ordinary AIPyW IPW-Riesz version.

Show code

def scaled_numpy(df, outcome_col="y"):
    X = make_X(df)
    Xs = StandardScaler().fit_transform(X).astype(np.float32)
    d = df["D"].to_numpy(dtype=int)
    y = df[outcome_col].to_numpy(dtype=float)
    return Xs, d, y


def fit_aipyw_variants(df, sample_name, target, outcome_col="y"):
    X, d, y = scaled_numpy(df, outcome_col=outcome_col)
    variants = [
        ("aipyw_ipw_riesz", dict(riesz_method="ipw", bal_obj="quadratic")),
        ("aipyw_quadratic_balancing_riesz", dict(riesz_method="balancing", bal_obj="quadratic")),
        ("aipyw_entropy_balancing_riesz", dict(riesz_method="balancing", bal_obj="entropy")),
    ]
    rows = []
    for label, kwargs in variants:
        fit = AIPyW(
            propensity_model=LogisticRegression(max_iter=1000),
            outcome_model=LinearRegression(),
            n_splits=2,
            **kwargs,
        ).fit(X, d, y)
        psi = fit.influence_functions[:, 1] - fit.influence_functions[:, 0]
        rows.append({
            "sample": sample_name,
            "estimator": label,
            "estimate": psi.mean(),
            "se": psi.std(ddof=1) / np.sqrt(len(psi)),
            "experimental_target": target,
            "abs_error": abs(psi.mean() - target),
        })
    return pd.DataFrame(rows)

aipyw_estimates = pd.concat([
    fit_aipyw_variants(obs_df, "Obs Untrimmed", exp_target_untrimmed),
    fit_aipyw_variants(obs_trimmed_df, "Obs Trimmed", exp_target_trimmed),
], ignore_index=True).sort_values(["sample", "abs_error"])
display(aipyw_estimates.round(4))

	sample	estimator	estimate	se	experimental_target	abs_error
4	Obs Trimmed	aipyw_quadratic_balancing_riesz	0.2800	0.0011	0.2183	0.0617
5	Obs Trimmed	aipyw_entropy_balancing_riesz	0.2800	0.0011	0.2183	0.0617
3	Obs Trimmed	aipyw_ipw_riesz	0.2812	0.0578	0.2183	0.0629
2	Obs Untrimmed	aipyw_entropy_balancing_riesz	0.2881	0.0010	0.1555	0.1326
1	Obs Untrimmed	aipyw_quadratic_balancing_riesz	0.2881	0.0010	0.1555	0.1326
0	Obs Untrimmed	aipyw_ipw_riesz	0.3294	0.0669	0.1555	0.1740

The ordinary IPW-Riesz variant estimates the Riesz representer through a propensity model. The quadratic and entropy balancing variants instead estimate the Riesz representer by directly balancing covariate moments in each treatment arm against the target sample. Conceptually, they are alternatives for the same orthogonal AIPW score, not a different family of estimators outside aipyw.

`py-econometrics/fastmatch`

fastmatch is also fit directly below. I use the exact scikit backend because the FAISS backend segfaulted in this environment; scikit is exact and fast enough here.

Show code

def fit_fastmatch_variants(df, sample_name, target, outcome_col="y"):
    X, d, y = scaled_numpy(df, outcome_col=outcome_col)
    rows = []
    for k in [1, 5]:
        for bias_corr in [False, True]:
            label = f"fastmatch_scikit_ate_k{k}" + ("_ridge_bc" if bias_corr else "")
            mod = RidgeCV(alphas=np.logspace(-4, 4, 9)) if bias_corr else None
            est, se = Matching("ATE", k=k, backend="scikit", bias_corr_mod=mod).fit(y, d, X)
            rows.append({
                "sample": sample_name,
                "estimator": label,
                "estimate": est,
                "se": se,
                "experimental_target": target,
                "abs_error": abs(est - target),
            })
    return pd.DataFrame(rows)

fastmatch_estimates = pd.concat([
    fit_fastmatch_variants(obs_df, "Obs Untrimmed", exp_target_untrimmed),
    fit_fastmatch_variants(obs_trimmed_df, "Obs Trimmed", exp_target_trimmed),
], ignore_index=True).sort_values(["sample", "abs_error"])
display(fastmatch_estimates.round(4))

	sample	estimator	estimate	se	experimental_target	abs_error
7	Obs Trimmed	fastmatch_scikit_ate_k5_ridge_bc	0.2211	0.0646	0.2183	0.0028
6	Obs Trimmed	fastmatch_scikit_ate_k5	0.2327	NaN	0.2183	0.0143
4	Obs Trimmed	fastmatch_scikit_ate_k1	0.1872	NaN	0.2183	0.0311
5	Obs Trimmed	fastmatch_scikit_ate_k1_ridge_bc	0.1443	0.0706	0.2183	0.0740
1	Obs Untrimmed	fastmatch_scikit_ate_k1_ridge_bc	0.2536	0.0734	0.1555	0.0981
0	Obs Untrimmed	fastmatch_scikit_ate_k1	0.3097	NaN	0.1555	0.1543
3	Obs Untrimmed	fastmatch_scikit_ate_k5_ridge_bc	0.4033	0.0633	0.1555	0.2478
2	Obs Untrimmed	fastmatch_scikit_ate_k5	0.4205	NaN	0.1555	0.2650

Combined in-document comparison

Show code

combined = pd.concat([core_estimates, aipyw_estimates, fastmatch_estimates], ignore_index=True)
combined = combined.sort_values(["sample", "abs_error"])
display(combined.round(4))

	sample	estimator	estimate	se	experimental_target	abs_error
16	Obs Trimmed	fastmatch_scikit_ate_k5_ridge_bc	0.2211	0.0646	0.2183	0.0028
17	Obs Trimmed	fastmatch_scikit_ate_k5	0.2327	NaN	0.2183	0.0143
18	Obs Trimmed	fastmatch_scikit_ate_k1	0.1872	NaN	0.2183	0.0311
0	Obs Trimmed	OM	0.2727	0.0011	0.2183	0.0544
10	Obs Trimmed	aipyw_quadratic_balancing_riesz	0.2800	0.0011	0.2183	0.0617
11	Obs Trimmed	aipyw_entropy_balancing_riesz	0.2800	0.0011	0.2183	0.0617
12	Obs Trimmed	aipyw_ipw_riesz	0.2812	0.0578	0.2183	0.0629
1	Obs Trimmed	Reg	0.2875	0.0572	0.2183	0.0692
19	Obs Trimmed	fastmatch_scikit_ate_k1_ridge_bc	0.1443	0.0706	0.2183	0.0740
2	Obs Trimmed	DR/AIPW	0.2928	0.0600	0.2183	0.0745
3	Obs Trimmed	PSM	0.2982	0.0192	0.2183	0.0799
4	Obs Trimmed	IPW	0.3343	0.0734	0.2183	0.1159
20	Obs Untrimmed	fastmatch_scikit_ate_k1_ridge_bc	0.2536	0.0734	0.1555	0.0981
5	Obs Untrimmed	PSM	0.2749	0.0148	0.1555	0.1194
13	Obs Untrimmed	aipyw_entropy_balancing_riesz	0.2881	0.0010	0.1555	0.1326
14	Obs Untrimmed	aipyw_quadratic_balancing_riesz	0.2881	0.0010	0.1555	0.1326
6	Obs Untrimmed	OM	0.2925	0.0010	0.1555	0.1370
21	Obs Untrimmed	fastmatch_scikit_ate_k1	0.3097	NaN	0.1555	0.1543
7	Obs Untrimmed	Reg	0.3255	0.0540	0.1555	0.1700
15	Obs Untrimmed	aipyw_ipw_riesz	0.3294	0.0669	0.1555	0.1740
22	Obs Untrimmed	fastmatch_scikit_ate_k5_ridge_bc	0.4033	0.0633	0.1555	0.2478
23	Obs Untrimmed	fastmatch_scikit_ate_k5	0.4205	NaN	0.1555	0.2650
8	Obs Untrimmed	DR/AIPW	0.5671	0.1711	0.1555	0.4116
9	Obs Untrimmed	IPW	0.7955	0.1885	0.1555	0.6400

Because everything in this table is fit inside the document, the exact numbers can differ from the paper’s cached AutoML/ensemble notebook outputs. That is expected. The memo’s qualitative lessons are the same: trimming matters; nuisance choice matters; matching is much more credible in the overlap sample than untrimmed; and AIPW/Riesz estimators are only as good as the measured confounding information and nuisance fits they are given.

Binary outcome: same specifications

The arXiv source emphasizes that the continuous-outcome success is not universal. To make that visible, this section reruns the same estimator specifications on Outcome_Binary: same covariates, same trimming rule, same LGBM nuisance specifications for the core estimators, same AIPyW Riesz choices, and same fastmatch variants. The outcome is binary, but the estimand is still a mean difference / risk difference.

Show code

# Experimental binary benchmarks, fit in-document with the same DR/AIPW code path.
exp_binary_target_untrimmed = estimate_core_methods(
    exp_df, "Exp Untrimmed", np.nan, outcome_col="yb"
).query("estimator == 'DR/AIPW'")["estimate"].iloc[0]
exp_binary_target_trimmed = estimate_core_methods(
    exp_trimmed_df, "Exp Trimmed", np.nan, outcome_col="yb"
).query("estimator == 'DR/AIPW'")["estimate"].iloc[0]

binary_targets = pd.DataFrame([
    {"sample": "Exp Untrimmed", "estimate": exp_binary_target_untrimmed},
    {"sample": "Exp Trimmed", "estimate": exp_binary_target_trimmed},
])
display(binary_targets.round(4))

binary_core = pd.concat([
    estimate_core_methods(obs_df, "Obs Untrimmed", exp_binary_target_untrimmed, outcome_col="yb"),
    estimate_core_methods(obs_trimmed_df, "Obs Trimmed", exp_binary_target_trimmed, outcome_col="yb"),
], ignore_index=True).sort_values(["sample", "abs_error"])
display(binary_core.round(4))

	sample	estimate
0	Exp Untrimmed	-0.3734
1	Exp Trimmed	-0.4162

	sample	estimator	estimate	se	experimental_target	abs_error
7	Obs Trimmed	IPW	-0.5321	0.0055	-0.4162	0.1159
8	Obs Trimmed	PSM	-0.5324	0.0009	-0.4162	0.1162
6	Obs Trimmed	OM	-0.5345	0.0001	-0.4162	0.1182
9	Obs Trimmed	DR/AIPW	-0.5375	0.0043	-0.4162	0.1213
5	Obs Trimmed	Reg	-0.5434	0.0036	-0.4162	0.1272
2	Obs Untrimmed	IPW	-0.4351	0.0167	-0.3734	0.0618
3	Obs Untrimmed	PSM	-0.4717	0.0008	-0.3734	0.0984
4	Obs Untrimmed	DR/AIPW	-0.4845	0.0102	-0.3734	0.1111
1	Obs Untrimmed	OM	-0.4976	0.0001	-0.3734	0.1243
0	Obs Untrimmed	Reg	-0.5338	0.0035	-0.3734	0.1605

Show code

binary_aipyw = pd.concat([
    fit_aipyw_variants(obs_df, "Obs Untrimmed", exp_binary_target_untrimmed, outcome_col="yb"),
    fit_aipyw_variants(obs_trimmed_df, "Obs Trimmed", exp_binary_target_trimmed, outcome_col="yb"),
], ignore_index=True).sort_values(["sample", "abs_error"])
display(binary_aipyw.round(4))

	sample	estimator	estimate	se	experimental_target	abs_error
4	Obs Trimmed	aipyw_quadratic_balancing_riesz	-0.5346	0.0001	-0.4162	0.1184
5	Obs Trimmed	aipyw_entropy_balancing_riesz	-0.5346	0.0001	-0.4162	0.1184
3	Obs Trimmed	aipyw_ipw_riesz	-0.5348	0.0038	-0.4162	0.1186
0	Obs Untrimmed	aipyw_ipw_riesz	-0.4960	0.0048	-0.3734	0.1227
2	Obs Untrimmed	aipyw_entropy_balancing_riesz	-0.4970	0.0001	-0.3734	0.1236
1	Obs Untrimmed	aipyw_quadratic_balancing_riesz	-0.4970	0.0001	-0.3734	0.1236

Show code

binary_fastmatch = pd.concat([
    fit_fastmatch_variants(obs_df, "Obs Untrimmed", exp_binary_target_untrimmed, outcome_col="yb"),
    fit_fastmatch_variants(obs_trimmed_df, "Obs Trimmed", exp_binary_target_trimmed, outcome_col="yb"),
], ignore_index=True).sort_values(["sample", "abs_error"])
display(binary_fastmatch.round(4))

	sample	estimator	estimate	se	experimental_target	abs_error
5	Obs Trimmed	fastmatch_scikit_ate_k1_ridge_bc	-0.5302	0.0048	-0.4162	0.1139
7	Obs Trimmed	fastmatch_scikit_ate_k5_ridge_bc	-0.5308	0.0042	-0.4162	0.1146
4	Obs Trimmed	fastmatch_scikit_ate_k1	-0.5332	NaN	-0.4162	0.1170
6	Obs Trimmed	fastmatch_scikit_ate_k5	-0.5336	NaN	-0.4162	0.1174
1	Obs Untrimmed	fastmatch_scikit_ate_k1_ridge_bc	-0.4761	0.0065	-0.3734	0.1028
0	Obs Untrimmed	fastmatch_scikit_ate_k1	-0.4900	NaN	-0.3734	0.1166
3	Obs Untrimmed	fastmatch_scikit_ate_k5_ridge_bc	-0.4938	0.0051	-0.3734	0.1204
2	Obs Untrimmed	fastmatch_scikit_ate_k5	-0.5006	NaN	-0.3734	0.1272

Show code

binary_combined = pd.concat([binary_core, binary_aipyw, binary_fastmatch], ignore_index=True)
binary_combined = binary_combined.sort_values(["sample", "abs_error"])
display(binary_combined.round(4))

	sample	estimator	estimate	se	experimental_target	abs_error
16	Obs Trimmed	fastmatch_scikit_ate_k1_ridge_bc	-0.5302	0.0048	-0.4162	0.1139
17	Obs Trimmed	fastmatch_scikit_ate_k5_ridge_bc	-0.5308	0.0042	-0.4162	0.1146
0	Obs Trimmed	IPW	-0.5321	0.0055	-0.4162	0.1159
1	Obs Trimmed	PSM	-0.5324	0.0009	-0.4162	0.1162
18	Obs Trimmed	fastmatch_scikit_ate_k1	-0.5332	NaN	-0.4162	0.1170
19	Obs Trimmed	fastmatch_scikit_ate_k5	-0.5336	NaN	-0.4162	0.1174
2	Obs Trimmed	OM	-0.5345	0.0001	-0.4162	0.1182
10	Obs Trimmed	aipyw_quadratic_balancing_riesz	-0.5346	0.0001	-0.4162	0.1184
11	Obs Trimmed	aipyw_entropy_balancing_riesz	-0.5346	0.0001	-0.4162	0.1184
12	Obs Trimmed	aipyw_ipw_riesz	-0.5348	0.0038	-0.4162	0.1186
3	Obs Trimmed	DR/AIPW	-0.5375	0.0043	-0.4162	0.1213
4	Obs Trimmed	Reg	-0.5434	0.0036	-0.4162	0.1272
5	Obs Untrimmed	IPW	-0.4351	0.0167	-0.3734	0.0618
6	Obs Untrimmed	PSM	-0.4717	0.0008	-0.3734	0.0984
20	Obs Untrimmed	fastmatch_scikit_ate_k1_ridge_bc	-0.4761	0.0065	-0.3734	0.1028
7	Obs Untrimmed	DR/AIPW	-0.4845	0.0102	-0.3734	0.1111
21	Obs Untrimmed	fastmatch_scikit_ate_k1	-0.4900	NaN	-0.3734	0.1166
22	Obs Untrimmed	fastmatch_scikit_ate_k5_ridge_bc	-0.4938	0.0051	-0.3734	0.1204
13	Obs Untrimmed	aipyw_ipw_riesz	-0.4960	0.0048	-0.3734	0.1227
14	Obs Untrimmed	aipyw_entropy_balancing_riesz	-0.4970	0.0001	-0.3734	0.1236
15	Obs Untrimmed	aipyw_quadratic_balancing_riesz	-0.4970	0.0001	-0.3734	0.1236
8	Obs Untrimmed	OM	-0.4976	0.0001	-0.3734	0.1243
23	Obs Untrimmed	fastmatch_scikit_ate_k5	-0.5006	NaN	-0.3734	0.1272
9	Obs Untrimmed	Reg	-0.5338	0.0035	-0.3734	0.1605

This is the binary-outcome analogue of the continuous table above. The substantive read is the paper’s negative result: applying the same machinery to the binary outcome does not make the observational estimates line up as cleanly with the experiment. Flexible ML and double robustness extract signal from observed covariates; they do not create unobserved confounders. If observed \(X\) predicts treatment but not the relevant potential-outcome surfaces, causal adjustment can still fail.

CATE and sensitivity analysis

For CATE, the paper uses DR-style pseudo-outcomes,

\[ Y_i^{DR}=\hat\mu_1(X_i)-\hat\mu_0(X_i)+(Y_i-\hat\mu_{D_i}(X_i))\frac{D_i-\hat e(X_i)}{\hat e(X_i)(1-\hat e(X_i))}, \]

then regresses \(Y_i^{DR}\) on covariates. The paper finds little useful heterogeneity for the continuous outcome and more evidence for binary-outcome heterogeneity, though the binary average effect remains biased in the observational sample.

For sensitivity analysis, the paper uses Chernozhukov-style omitted-confounder diagnostics. These ask how much an unobserved confounder would need to explain treatment and residual outcome variation to move the estimate. A key lesson is that robustness-to-zero is not the same as accuracy of magnitude.

Bottom line

The dataset is a valuable validation design: randomized experiment plus simultaneous observational adoption sample.
The data are not a LaLonde-style reuse of identical treated units.
For the continuous outcome, observational methods can get close when overlap trimming, tuned nuisance models, cross-fitting, and ensembling are used.
For the binary outcome, the same machinery fails, showing the limit of unconfoundedness when relevant confounders are missing.
The balancing Riesz estimators belong under aipyw; they are alternative Riesz-representer choices inside the AIPW score.
The code in this memo now directly fits each estimator in folded Quarto cells rather than reading precomputed estimator CSVs.