Reevaluating Causal Estimation Methods: data and replication sitrep

Author

Krabbs

Published

May 26, 2026

What I pulled

GitHub replication archive: https://github.com/microsoft/Reevaluating-Causal-Estimation-Methods
Local clone: /Users/alal/tmp/krabbs_the_koder/reevaluating_causal_methods
arXiv source: https://arxiv.org/abs/2601.11845
Local arXiv source TeX: /Users/alal/tmp/krabbs_the_koder/rcem_arxiv_2601_11845/source/paper_1225.tex

The arXiv source describes the paper as a product-release validation design: a randomized experiment and a simultaneous observational adoption sample for the same masked Windows feature.

What the two datasets are

The archive contains two parquet files with the same 44-column schema:

FINAL_PUBLIC_experimental.parquet: randomized A/B experiment.
FINAL_PUBLIC_observed.parquet: observational adoption sample where users had already manually opted into the feature or had not.

Both measure the same treatment (New_Feature), two outcomes (Outcome_Binary, Outcome_Continuous), and 41 pre-treatment/device-use covariates: device specs, total/browser usage, engagement segment, anonymized/scrambled region, app cohort, manufacturer, and additional device specs.

The paper’s key design point is that this is not the classic LaLonde construction of identical experimental treated units paired with arbitrary observational controls. It is a paired product setting with both treated and untreated units in both samples: randomized assignment in the experimental sample, endogenous adoption in the observational sample.

	dataset	n	treated	controls	treated_share	Outcome_Continuous_diff_t_minus_c	Outcome_Binary_diff_t_minus_c
0	Experimental	435170	11475	423695	0.0264	0.4326	-0.3847
1	Observed	445286	20189	425097	0.0453	1.2077	-0.5320

LaLonde-structure check

Exact-row overlap between experimental and observational treated units is essentially zero, so the data do not mimic the LDW-CPS/PSID structure where experimental treated units are reused and only the controls change.

	check	exp_rows	obs_rows	unique_exp_hashes	unique_obs_hashes	intersection_hashes	intersection_over_exp_unique
0	full rows, treated only	11475	20189	11473	20189	1	0.000087
1	covariates+outcomes, treated only	11475	20189	11473	20189	1	0.000087
2	covariates only, treated only	11475	20189	11471	20189	2	0.000174
3	full rows, all units	435170	445286	433980	444031	505	0.001164
4	covariates+outcomes, all units	435170	445286	433948	444027	510	0.001175
5	covariates only, all units	435170	445286	433717	443807	518	0.001194

Interpretation: the treated samples are different devices. There is only 1 exact full-row treated hash match and 2 covariates-only treated hash matches, out of 11,471+ unique experimental treated covariate rows.

Balance

The experimental treatment assignment is close to balanced; observational adoption is selected.

	mean	median	max	n_abs_smd_gt_0.1
dataset
Experimental	0.0250	0.0218	0.1097	1
Observed	0.0894	0.0391	0.5108	9

The largest observational imbalances are exactly the kind of variables one would expect to predict organic adoption: high engagement, low engagement, total device usage, browser usage, and device specs.

End-to-end replication status

I created a clean Python 3.12 virtual environment in the repo and installed the pinned requirements.txt successfully. A Python 3.14 venv failed because pip tried to build heavy packages from source and the machine was low on disk; Python 3.12 used wheels and worked.

I then executed the provided notebook end-to-end with the repository’s default settings:

RERUN_ALL = False
RUN_PSM   = False
N_MODELS  = 10
TIME_BUDGET = 100

Execution command used:

cd /Users/alal/tmp/krabbs_the_koder/reevaluating_causal_methods
. .venv/bin/activate
jupyter nbconvert --to notebook --execute notebooks/01_main_results.ipynb \
  --output executed_main_results.ipynb \
  --ExecutePreprocessor.kernel_name=rcem-repl \
  --ExecutePreprocessor.timeout=1800

Result: the notebook completed successfully and wrote executed_main_results.ipynb.

Important caveats:

This is end-to-end for the archive’s default fast replication path, not a full from-scratch paper rerun.
RUN_PSM=False, so PSM estimates use the cached PSM artifacts in saved_outputs/; recomputing PSM requires R, rpy2, and the R Matching package.
RERUN_ALL=False, so the notebook loads cached hyperparameters/artifacts where provided. It still recomputed the experimental propensity model because prop_exp_FLAML_FINAL_LGBM.pkl is not shipped and load_or_compute(..., save=False) does not save it.
The notebook reports that R is unavailable and therefore uses cached PSM results, which is consistent with its default documented path.

So: yes, I can run the replication archive end-to-end in the intended default mode. A strict from-scratch rerun including PSM is not self-contained on Python alone; it needs R-side dependencies and considerably more runtime.

Main replicated continuous-outcome results

Experimental benchmark from cached artifact:

	sample	estimate	stderr	lower	upper
0	Exp Untrimmed	0.2046	0.0587	0.0896	0.3195
1	Exp Trimmed Off	0.1652	0.0826	0.0032	0.3272
2	Exp Trimmed	0.2099	0.0774	0.0583	0.3615

Trimmed observational estimates:

	Estimator	Estimate	SE	CI_Lower	CI_Upper
0	Reg	0.2823	0.0578	0.1689	0.3956
1	OM	0.2336	0.0569	0.1221	0.3452
2	IPW	0.2060	0.0712	0.0664	0.3456
3	PSM	0.3391	0.0571	0.2271	0.4510
4	DR	0.1989	0.0561	0.0890	0.3088

Untrimmed observational estimates:

	Estimator	Estimate	SE	CI_Lower	CI_Upper
0	Reg	0.3255	0.0545	0.2187	0.4323
1	OM	0.3311	0.0532	0.2268	0.4354
2	IPW	0.1917	0.0720	0.0505	0.3329
3	PSM	0.7433	0.0528	0.6399	0.8468
4	DR	0.2215	0.0502	0.1230	0.3199

The trimmed estimates line up with the paper’s stated message: after trimming and using the preferred nuisance setup, most methods are close to the experimental continuous-outcome benchmark. The untrimmed sample shows more dispersion, with PSM especially far from the experimental benchmark.

Bottom line

The two released datasets are parallel product-release samples: one randomized, one observational/self-selected.
They are not a LaLonde-style reuse of the same treated units with a different observational control pool.
The observational sample has real selection into treatment, but the experimental and observational samples are designed to cover similar covariate regions.
The archive is runnable in its default mode from the released repo/data. Full from-scratch replication is only partially self-contained because PSM depends on external R tooling, and some expensive AutoML pieces are intentionally cached rather than rerun by default.