Reevaluating Causal Estimation Methods: data and replication sitrep

Author

Krabbs

Published

May 26, 2026

What I pulled

  • GitHub replication archive: https://github.com/microsoft/Reevaluating-Causal-Estimation-Methods
  • Local clone: /Users/alal/tmp/krabbs_the_koder/reevaluating_causal_methods
  • arXiv source: https://arxiv.org/abs/2601.11845
  • Local arXiv source TeX: /Users/alal/tmp/krabbs_the_koder/rcem_arxiv_2601_11845/source/paper_1225.tex

The arXiv source describes the paper as a product-release validation design: a randomized experiment and a simultaneous observational adoption sample for the same masked Windows feature.

What the two datasets are

The archive contains two parquet files with the same 44-column schema:

  1. FINAL_PUBLIC_experimental.parquet: randomized A/B experiment.
  2. FINAL_PUBLIC_observed.parquet: observational adoption sample where users had already manually opted into the feature or had not.

Both measure the same treatment (New_Feature), two outcomes (Outcome_Binary, Outcome_Continuous), and 41 pre-treatment/device-use covariates: device specs, total/browser usage, engagement segment, anonymized/scrambled region, app cohort, manufacturer, and additional device specs.

The paper’s key design point is that this is not the classic LaLonde construction of identical experimental treated units paired with arbitrary observational controls. It is a paired product setting with both treated and untreated units in both samples: randomized assignment in the experimental sample, endogenous adoption in the observational sample.

dataset n treated controls treated_share Outcome_Continuous_diff_t_minus_c Outcome_Binary_diff_t_minus_c
0 Experimental 435170 11475 423695 0.0264 0.4326 -0.3847
1 Observed 445286 20189 425097 0.0453 1.2077 -0.5320

LaLonde-structure check

Exact-row overlap between experimental and observational treated units is essentially zero, so the data do not mimic the LDW-CPS/PSID structure where experimental treated units are reused and only the controls change.

check exp_rows obs_rows unique_exp_hashes unique_obs_hashes intersection_hashes intersection_over_exp_unique
0 full rows, treated only 11475 20189 11473 20189 1 0.000087
1 covariates+outcomes, treated only 11475 20189 11473 20189 1 0.000087
2 covariates only, treated only 11475 20189 11471 20189 2 0.000174
3 full rows, all units 435170 445286 433980 444031 505 0.001164
4 covariates+outcomes, all units 435170 445286 433948 444027 510 0.001175
5 covariates only, all units 435170 445286 433717 443807 518 0.001194

Interpretation: the treated samples are different devices. There is only 1 exact full-row treated hash match and 2 covariates-only treated hash matches, out of 11,471+ unique experimental treated covariate rows.

Balance

The experimental treatment assignment is close to balanced; observational adoption is selected.

mean median max n_abs_smd_gt_0.1
dataset
Experimental 0.0250 0.0218 0.1097 1
Observed 0.0894 0.0391 0.5108 9

The largest observational imbalances are exactly the kind of variables one would expect to predict organic adoption: high engagement, low engagement, total device usage, browser usage, and device specs.

End-to-end replication status

I created a clean Python 3.12 virtual environment in the repo and installed the pinned requirements.txt successfully. A Python 3.14 venv failed because pip tried to build heavy packages from source and the machine was low on disk; Python 3.12 used wheels and worked.

I then executed the provided notebook end-to-end with the repository’s default settings:

RERUN_ALL = False
RUN_PSM   = False
N_MODELS  = 10
TIME_BUDGET = 100

Execution command used:

cd /Users/alal/tmp/krabbs_the_koder/reevaluating_causal_methods
. .venv/bin/activate
jupyter nbconvert --to notebook --execute notebooks/01_main_results.ipynb \
  --output executed_main_results.ipynb \
  --ExecutePreprocessor.kernel_name=rcem-repl \
  --ExecutePreprocessor.timeout=1800

Result: the notebook completed successfully and wrote executed_main_results.ipynb.

Important caveats:

  • This is end-to-end for the archive’s default fast replication path, not a full from-scratch paper rerun.
  • RUN_PSM=False, so PSM estimates use the cached PSM artifacts in saved_outputs/; recomputing PSM requires R, rpy2, and the R Matching package.
  • RERUN_ALL=False, so the notebook loads cached hyperparameters/artifacts where provided. It still recomputed the experimental propensity model because prop_exp_FLAML_FINAL_LGBM.pkl is not shipped and load_or_compute(..., save=False) does not save it.
  • The notebook reports that R is unavailable and therefore uses cached PSM results, which is consistent with its default documented path.

So: yes, I can run the replication archive end-to-end in the intended default mode. A strict from-scratch rerun including PSM is not self-contained on Python alone; it needs R-side dependencies and considerably more runtime.

Main replicated continuous-outcome results

Experimental benchmark from cached artifact:

sample estimate stderr lower upper
0 Exp Untrimmed 0.2046 0.0587 0.0896 0.3195
1 Exp Trimmed Off 0.1652 0.0826 0.0032 0.3272
2 Exp Trimmed 0.2099 0.0774 0.0583 0.3615

Trimmed observational estimates:

Estimator Estimate SE CI_Lower CI_Upper
0 Reg 0.2823 0.0578 0.1689 0.3956
1 OM 0.2336 0.0569 0.1221 0.3452
2 IPW 0.2060 0.0712 0.0664 0.3456
3 PSM 0.3391 0.0571 0.2271 0.4510
4 DR 0.1989 0.0561 0.0890 0.3088

Untrimmed observational estimates:

Estimator Estimate SE CI_Lower CI_Upper
0 Reg 0.3255 0.0545 0.2187 0.4323
1 OM 0.3311 0.0532 0.2268 0.4354
2 IPW 0.1917 0.0720 0.0505 0.3329
3 PSM 0.7433 0.0528 0.6399 0.8468
4 DR 0.2215 0.0502 0.1230 0.3199

The trimmed estimates line up with the paper’s stated message: after trimming and using the preferred nuisance setup, most methods are close to the experimental continuous-outcome benchmark. The untrimmed sample shows more dispersion, with PSM especially far from the experimental benchmark.

Bottom line

  • The two released datasets are parallel product-release samples: one randomized, one observational/self-selected.
  • They are not a LaLonde-style reuse of the same treated units with a different observational control pool.
  • The observational sample has real selection into treatment, but the experimental and observational samples are designed to cover similar covariate regions.
  • The archive is runnable in its default mode from the released repo/data. Full from-scratch replication is only partially self-contained because PSM depends on external R tooling, and some expensive AutoML pieces are intentionally cached rather than rerun by default.