Arsenic wells: average predictive comparisons

Source: Arsenic/arsenic_logistic_apc.Rmd

This page ports the ROS average predictive comparison example. We fit logistic regressions for switching wells and then average counterfactual changes in predicted probability over the observed households.

Setup and data

Code

from pathlib import Path
import numpy as np
import pandas as pd
from scipy.special import expit
from python.bayes_glm import bayes_logit

root = Path("../../ROS-Examples")
wells = pd.read_csv(root / "Arsenic/data/wells.csv")
wells["dist100"] = wells["dist"] / 100
wells["educ4"] = wells["educ"] / 4
wells.head()

	switch	arsenic	dist	dist100	assoc	educ	educ4
0	1	2.36	16.826000	0.16826	0	0	0.0
1	1	0.71	47.321999	0.47322	0	0	0.0
2	0	2.07	20.966999	0.20967	0	10	2.5
3	1	1.15	21.486000	0.21486	0	12	3.0
4	1	1.10	40.874001	0.40874	1	14	3.5

Model with distance, arsenic, and education

The R page uses stan_glm; the Python port uses a Bayesian logistic regression with weakly informative Normal priors, then averages over posterior draws.

Code

fit_7 = bayes_logit("switch ~ dist100 + arsenic + educ4", data=wells, draws=2000, seed=1201)
fit_7.summary().round(2)

	mean	sd	q5	q50	q95
Intercept	-0.21	0.09	-0.37	-0.22	-0.06
dist100	-0.90	0.11	-1.07	-0.89	-0.72
arsenic	0.47	0.04	0.40	0.47	0.54
educ4	0.17	0.04	0.11	0.17	0.23

Average predictive comparisons

A helper for changing one covariate while holding each household’s other covariates fixed:

Code

def average_predictive_difference(fit, data, variable, lo, hi):
    lo_data = data.copy()
    hi_data = data.copy()
    lo_data[variable] = lo
    hi_data[variable] = hi
    return float((fit.epred(hi_data) - fit.epred(lo_data)).mean(axis=1).mean())

apc_distance = average_predictive_difference(fit_7, wells, "dist100", 0, 1)
apc_arsenic = average_predictive_difference(fit_7, wells, "arsenic", 0.5, 1.0)
apc_education = average_predictive_difference(fit_7, wells, "educ4", 0, 3)  # 0 vs 12 years
pd.Series({
    "distance: 0 to 100 meters": apc_distance,
    "arsenic: 0.5 to 1.0": apc_arsenic,
    "education: 0 to 12 years": apc_education,
}).round(2)

distance: 0 to 100 meters   -0.20
arsenic: 0.5 to 1.0          0.06
education: 0 to 12 years     0.12
dtype: float64

These are average changes in predicted probability, not regression coefficients. They are on the probability scale and therefore depend on the observed distribution of the other predictors.

Interaction model

The original also computes the distance APC after centering and adding interactions with education.

Code

wells["c_dist100"] = wells["dist100"] - wells["dist100"].mean()
wells["c_arsenic"] = wells["arsenic"] - wells["arsenic"].mean()
wells["c_educ4"] = wells["educ4"] - wells["educ4"].mean()
fit_8 = bayes_logit(
    "switch ~ c_dist100 + c_arsenic + c_educ4 + c_dist100:c_educ4 + c_arsenic:c_educ4",
    data=wells,
    draws=2000,
    seed=1202,
)
fit_8.summary().round(2)

	mean	sd	q5	q50	q95
Intercept	0.34	0.04	0.28	0.34	0.41
c_dist100	-0.92	0.10	-1.09	-0.92	-0.74
c_arsenic	0.49	0.04	0.42	0.49	0.56
c_educ4	0.19	0.04	0.13	0.19	0.25
c_dist100:c_educ4	0.33	0.11	0.16	0.33	0.50
c_arsenic:c_educ4	0.08	0.04	0.01	0.08	0.15

For a centered predictor, c_dist100=0 is an average-distance household. The original comparison sets the centered distance predictor to 0 and 1, again averaging over observed arsenic and education.

Code

apc_distance_interaction = average_predictive_difference(fit_8, wells, "c_dist100", 0, 1)
round(apc_distance_interaction, 2)

-0.21

Direct formula check

The same APC can be computed by evaluating the fitted linear predictor by hand:

Code

b = fit_8.beta_draws
names = fit_8.coef_names
idx = {name: j for j, name in enumerate(names)}
hi, lo = 1, 0
eta_hi = (
    b[:, idx["Intercept"], None] + b[:, idx["c_dist100"], None] * hi + b[:, idx["c_arsenic"], None] * wells["c_arsenic"].to_numpy()
    + b[:, idx["c_educ4"], None] * wells["c_educ4"].to_numpy()
    + b[:, idx["c_dist100:c_educ4"], None] * hi * wells["c_educ4"].to_numpy()
    + b[:, idx["c_arsenic:c_educ4"], None] * wells["c_arsenic"].to_numpy() * wells["c_educ4"].to_numpy()
)
eta_lo = eta_hi - b[:, idx["c_dist100"], None] - b[:, idx["c_dist100:c_educ4"], None] * wells["c_educ4"].to_numpy()
round(float((expit(eta_hi) - expit(eta_lo)).mean()), 2)

-0.21

# Arsenic wells: average predictive comparisons Source: `Arsenic/arsenic_logistic_apc.Rmd` This page ports the ROS average predictive comparison example. We fit logistic regressions for switching wells and then average counterfactual changes in predicted probability over the observed households. ## Setup and data ```{python} from pathlib import Path import numpy as np import pandas as pd from scipy.special import expit from python.bayes_glm import bayes_logit root = Path("../../ROS-Examples") wells = pd.read_csv(root / "Arsenic/data/wells.csv") wells["dist100"] = wells["dist"] / 100 wells["educ4"] = wells["educ"] / 4 wells.head() ``` ## Model with distance, arsenic, and education The R page uses `stan_glm`; the Python port uses a Bayesian logistic regression with weakly informative Normal priors, then averages over posterior draws. ```{python} fit_7 = bayes_logit("switch ~ dist100 + arsenic + educ4", data=wells, draws=2000, seed=1201) fit_7.summary().round(2) ``` ## Average predictive comparisons A helper for changing one covariate while holding each household's other covariates fixed: ```{python} def average_predictive_difference(fit, data, variable, lo, hi): lo_data = data.copy() hi_data = data.copy() lo_data[variable] = lo hi_data[variable] = hi return float((fit.epred(hi_data) - fit.epred(lo_data)).mean(axis=1).mean()) apc_distance = average_predictive_difference(fit_7, wells, "dist100", 0, 1) apc_arsenic = average_predictive_difference(fit_7, wells, "arsenic", 0.5, 1.0) apc_education = average_predictive_difference(fit_7, wells, "educ4", 0, 3) # 0 vs 12 years pd.Series({ "distance: 0 to 100 meters": apc_distance, "arsenic: 0.5 to 1.0": apc_arsenic, "education: 0 to 12 years": apc_education, }).round(2) ``` These are average changes in predicted probability, not regression coefficients. They are on the probability scale and therefore depend on the observed distribution of the other predictors. ## Interaction model The original also computes the distance APC after centering and adding interactions with education. ```{python} wells["c_dist100"] = wells["dist100"] - wells["dist100"].mean() wells["c_arsenic"] = wells["arsenic"] - wells["arsenic"].mean() wells["c_educ4"] = wells["educ4"] - wells["educ4"].mean() fit_8 = bayes_logit( "switch ~ c_dist100 + c_arsenic + c_educ4 + c_dist100:c_educ4 + c_arsenic:c_educ4", data=wells, draws=2000, seed=1202, ) fit_8.summary().round(2) ``` For a centered predictor, `c_dist100=0` is an average-distance household. The original comparison sets the centered distance predictor to 0 and 1, again averaging over observed arsenic and education. ```{python} apc_distance_interaction = average_predictive_difference(fit_8, wells, "c_dist100", 0, 1) round(apc_distance_interaction, 2) ``` ## Direct formula check The same APC can be computed by evaluating the fitted linear predictor by hand: ```{python} b = fit_8.beta_draws names = fit_8.coef_names idx = {name: j for j, name in enumerate(names)} hi, lo = 1, 0 eta_hi = ( b[:, idx["Intercept"], None] + b[:, idx["c_dist100"], None] * hi + b[:, idx["c_arsenic"], None] * wells["c_arsenic"].to_numpy() + b[:, idx["c_educ4"], None] * wells["c_educ4"].to_numpy() + b[:, idx["c_dist100:c_educ4"], None] * hi * wells["c_educ4"].to_numpy() + b[:, idx["c_arsenic:c_educ4"], None] * wells["c_arsenic"].to_numpy() * wells["c_educ4"].to_numpy() ) eta_lo = eta_hi - b[:, idx["c_dist100"], None] - b[:, idx["c_dist100:c_educ4"], None] * wells["c_educ4"].to_numpy() round(float((expit(eta_hi) - expit(eta_lo)).mean()), 2) ```