This page ports the ROS average predictive comparison example. We fit logistic regressions for switching wells and then average counterfactual changes in predicted probability over the observed households.
Setup and data
Code
from pathlib import Pathimport numpy as npimport pandas as pdfrom scipy.special import expitfrom python.bayes_glm import bayes_logitroot = Path("../../ROS-Examples")wells = pd.read_csv(root /"Arsenic/data/wells.csv")wells["dist100"] = wells["dist"] /100wells["educ4"] = wells["educ"] /4wells.head()
switch
arsenic
dist
dist100
assoc
educ
educ4
0
1
2.36
16.826000
0.16826
0
0
0.0
1
1
0.71
47.321999
0.47322
0
0
0.0
2
0
2.07
20.966999
0.20967
0
10
2.5
3
1
1.15
21.486000
0.21486
0
12
3.0
4
1
1.10
40.874001
0.40874
1
14
3.5
Model with distance, arsenic, and education
The R page uses stan_glm; the Python port uses a Bayesian logistic regression with weakly informative Normal priors, then averages over posterior draws.
distance: 0 to 100 meters -0.20
arsenic: 0.5 to 1.0 0.06
education: 0 to 12 years 0.12
dtype: float64
These are average changes in predicted probability, not regression coefficients. They are on the probability scale and therefore depend on the observed distribution of the other predictors.
Interaction model
The original also computes the distance APC after centering and adding interactions with education.
For a centered predictor, c_dist100=0 is an average-distance household. The original comparison sets the centered distance predictor to 0 and 1, again averaging over observed arsenic and education.
# Arsenic wells: average predictive comparisonsSource: `Arsenic/arsenic_logistic_apc.Rmd`This page ports the ROS average predictive comparison example. We fit logistic regressions for switching wells and then average counterfactual changes in predicted probability over the observed households.## Setup and data```{python}from pathlib import Pathimport numpy as npimport pandas as pdfrom scipy.special import expitfrom python.bayes_glm import bayes_logitroot = Path("../../ROS-Examples")wells = pd.read_csv(root /"Arsenic/data/wells.csv")wells["dist100"] = wells["dist"] /100wells["educ4"] = wells["educ"] /4wells.head()```## Model with distance, arsenic, and educationThe R page uses `stan_glm`; the Python port uses a Bayesian logistic regression with weakly informative Normal priors, then averages over posterior draws.```{python}fit_7 = bayes_logit("switch ~ dist100 + arsenic + educ4", data=wells, draws=2000, seed=1201)fit_7.summary().round(2)```## Average predictive comparisonsA helper for changing one covariate while holding each household's other covariates fixed:```{python}def average_predictive_difference(fit, data, variable, lo, hi): lo_data = data.copy() hi_data = data.copy() lo_data[variable] = lo hi_data[variable] = hireturnfloat((fit.epred(hi_data) - fit.epred(lo_data)).mean(axis=1).mean())apc_distance = average_predictive_difference(fit_7, wells, "dist100", 0, 1)apc_arsenic = average_predictive_difference(fit_7, wells, "arsenic", 0.5, 1.0)apc_education = average_predictive_difference(fit_7, wells, "educ4", 0, 3) # 0 vs 12 yearspd.Series({"distance: 0 to 100 meters": apc_distance,"arsenic: 0.5 to 1.0": apc_arsenic,"education: 0 to 12 years": apc_education,}).round(2)```These are average changes in predicted probability, not regression coefficients. They are on the probability scale and therefore depend on the observed distribution of the other predictors.## Interaction modelThe original also computes the distance APC after centering and adding interactions with education.```{python}wells["c_dist100"] = wells["dist100"] - wells["dist100"].mean()wells["c_arsenic"] = wells["arsenic"] - wells["arsenic"].mean()wells["c_educ4"] = wells["educ4"] - wells["educ4"].mean()fit_8 = bayes_logit("switch ~ c_dist100 + c_arsenic + c_educ4 + c_dist100:c_educ4 + c_arsenic:c_educ4", data=wells, draws=2000, seed=1202,)fit_8.summary().round(2)```For a centered predictor, `c_dist100=0` is an average-distance household. The original comparison sets the centered distance predictor to 0 and 1, again averaging over observed arsenic and education.```{python}apc_distance_interaction = average_predictive_difference(fit_8, wells, "c_dist100", 0, 1)round(apc_distance_interaction, 2)```## Direct formula checkThe same APC can be computed by evaluating the fitted linear predictor by hand:```{python}b = fit_8.beta_drawsnames = fit_8.coef_namesidx = {name: j for j, name inenumerate(names)}hi, lo =1, 0eta_hi = ( b[:, idx["Intercept"], None] + b[:, idx["c_dist100"], None] * hi + b[:, idx["c_arsenic"], None] * wells["c_arsenic"].to_numpy()+ b[:, idx["c_educ4"], None] * wells["c_educ4"].to_numpy()+ b[:, idx["c_dist100:c_educ4"], None] * hi * wells["c_educ4"].to_numpy()+ b[:, idx["c_arsenic:c_educ4"], None] * wells["c_arsenic"].to_numpy() * wells["c_educ4"].to_numpy())eta_lo = eta_hi - b[:, idx["c_dist100"], None] - b[:, idx["c_dist100:c_educ4"], None] * wells["c_educ4"].to_numpy()round(float((expit(eta_hi) - expit(eta_lo)).mean()), 2)```