| dataset | n | treated | controls | treated_share | Outcome_Continuous_diff_t_minus_c | Outcome_Binary_diff_t_minus_c | |
|---|---|---|---|---|---|---|---|
| 0 | Experimental | 435170 | 11475 | 423695 | 0.0264 | 0.4326 | -0.3847 |
| 1 | Observed | 445286 | 20189 | 425097 | 0.0453 | 1.2077 | -0.5320 |
Reevaluating Causal Estimation Methods: data and replication sitrep
What I pulled
- GitHub replication archive:
https://github.com/microsoft/Reevaluating-Causal-Estimation-Methods - Local clone:
/Users/alal/tmp/krabbs_the_koder/reevaluating_causal_methods - arXiv source:
https://arxiv.org/abs/2601.11845 - Local arXiv source TeX:
/Users/alal/tmp/krabbs_the_koder/rcem_arxiv_2601_11845/source/paper_1225.tex
The arXiv source describes the paper as a product-release validation design: a randomized experiment and a simultaneous observational adoption sample for the same masked Windows feature.
What the two datasets are
The archive contains two parquet files with the same 44-column schema:
FINAL_PUBLIC_experimental.parquet: randomized A/B experiment.FINAL_PUBLIC_observed.parquet: observational adoption sample where users had already manually opted into the feature or had not.
Both measure the same treatment (New_Feature), two outcomes (Outcome_Binary, Outcome_Continuous), and 41 pre-treatment/device-use covariates: device specs, total/browser usage, engagement segment, anonymized/scrambled region, app cohort, manufacturer, and additional device specs.
The paper’s key design point is that this is not the classic LaLonde construction of identical experimental treated units paired with arbitrary observational controls. It is a paired product setting with both treated and untreated units in both samples: randomized assignment in the experimental sample, endogenous adoption in the observational sample.
LaLonde-structure check
Exact-row overlap between experimental and observational treated units is essentially zero, so the data do not mimic the LDW-CPS/PSID structure where experimental treated units are reused and only the controls change.
| check | exp_rows | obs_rows | unique_exp_hashes | unique_obs_hashes | intersection_hashes | intersection_over_exp_unique | |
|---|---|---|---|---|---|---|---|
| 0 | full rows, treated only | 11475 | 20189 | 11473 | 20189 | 1 | 0.000087 |
| 1 | covariates+outcomes, treated only | 11475 | 20189 | 11473 | 20189 | 1 | 0.000087 |
| 2 | covariates only, treated only | 11475 | 20189 | 11471 | 20189 | 2 | 0.000174 |
| 3 | full rows, all units | 435170 | 445286 | 433980 | 444031 | 505 | 0.001164 |
| 4 | covariates+outcomes, all units | 435170 | 445286 | 433948 | 444027 | 510 | 0.001175 |
| 5 | covariates only, all units | 435170 | 445286 | 433717 | 443807 | 518 | 0.001194 |
Interpretation: the treated samples are different devices. There is only 1 exact full-row treated hash match and 2 covariates-only treated hash matches, out of 11,471+ unique experimental treated covariate rows.
Balance
The experimental treatment assignment is close to balanced; observational adoption is selected.
| mean | median | max | n_abs_smd_gt_0.1 | |
|---|---|---|---|---|
| dataset | ||||
| Experimental | 0.0250 | 0.0218 | 0.1097 | 1 |
| Observed | 0.0894 | 0.0391 | 0.5108 | 9 |
The largest observational imbalances are exactly the kind of variables one would expect to predict organic adoption: high engagement, low engagement, total device usage, browser usage, and device specs.
End-to-end replication status
I created a clean Python 3.12 virtual environment in the repo and installed the pinned requirements.txt successfully. A Python 3.14 venv failed because pip tried to build heavy packages from source and the machine was low on disk; Python 3.12 used wheels and worked.
I then executed the provided notebook end-to-end with the repository’s default settings:
RERUN_ALL = False
RUN_PSM = False
N_MODELS = 10
TIME_BUDGET = 100Execution command used:
cd /Users/alal/tmp/krabbs_the_koder/reevaluating_causal_methods
. .venv/bin/activate
jupyter nbconvert --to notebook --execute notebooks/01_main_results.ipynb \
--output executed_main_results.ipynb \
--ExecutePreprocessor.kernel_name=rcem-repl \
--ExecutePreprocessor.timeout=1800Result: the notebook completed successfully and wrote executed_main_results.ipynb.
Important caveats:
- This is end-to-end for the archive’s default fast replication path, not a full from-scratch paper rerun.
RUN_PSM=False, so PSM estimates use the cached PSM artifacts insaved_outputs/; recomputing PSM requires R,rpy2, and the RMatchingpackage.RERUN_ALL=False, so the notebook loads cached hyperparameters/artifacts where provided. It still recomputed the experimental propensity model becauseprop_exp_FLAML_FINAL_LGBM.pklis not shipped andload_or_compute(..., save=False)does not save it.- The notebook reports that R is unavailable and therefore uses cached PSM results, which is consistent with its default documented path.
So: yes, I can run the replication archive end-to-end in the intended default mode. A strict from-scratch rerun including PSM is not self-contained on Python alone; it needs R-side dependencies and considerably more runtime.
Main replicated continuous-outcome results
Experimental benchmark from cached artifact:
| sample | estimate | stderr | lower | upper | |
|---|---|---|---|---|---|
| 0 | Exp Untrimmed | 0.2046 | 0.0587 | 0.0896 | 0.3195 |
| 1 | Exp Trimmed Off | 0.1652 | 0.0826 | 0.0032 | 0.3272 |
| 2 | Exp Trimmed | 0.2099 | 0.0774 | 0.0583 | 0.3615 |
Trimmed observational estimates:
| Estimator | Estimate | SE | CI_Lower | CI_Upper | |
|---|---|---|---|---|---|
| 0 | Reg | 0.2823 | 0.0578 | 0.1689 | 0.3956 |
| 1 | OM | 0.2336 | 0.0569 | 0.1221 | 0.3452 |
| 2 | IPW | 0.2060 | 0.0712 | 0.0664 | 0.3456 |
| 3 | PSM | 0.3391 | 0.0571 | 0.2271 | 0.4510 |
| 4 | DR | 0.1989 | 0.0561 | 0.0890 | 0.3088 |
Untrimmed observational estimates:
| Estimator | Estimate | SE | CI_Lower | CI_Upper | |
|---|---|---|---|---|---|
| 0 | Reg | 0.3255 | 0.0545 | 0.2187 | 0.4323 |
| 1 | OM | 0.3311 | 0.0532 | 0.2268 | 0.4354 |
| 2 | IPW | 0.1917 | 0.0720 | 0.0505 | 0.3329 |
| 3 | PSM | 0.7433 | 0.0528 | 0.6399 | 0.8468 |
| 4 | DR | 0.2215 | 0.0502 | 0.1230 | 0.3199 |
The trimmed estimates line up with the paper’s stated message: after trimming and using the preferred nuisance setup, most methods are close to the experimental continuous-outcome benchmark. The untrimmed sample shows more dispersion, with PSM especially far from the experimental benchmark.
Bottom line
- The two released datasets are parallel product-release samples: one randomized, one observational/self-selected.
- They are not a LaLonde-style reuse of the same treated units with a different observational control pool.
- The observational sample has real selection into treatment, but the experimental and observational samples are designed to cover similar covariate regions.
- The archive is runnable in its default mode from the released repo/data. Full from-scratch replication is only partially self-contained because PSM depends on external R tooling, and some expensive AutoML pieces are intentionally cached rather than rerun by default.