Arsenic wells: average predictive comparisons

Source: Arsenic/arsenic_logistic_apc.Rmd

This page ports the ROS average predictive comparison example. We fit logistic regressions for switching wells and then average counterfactual changes in predicted probability over the observed households.

Setup and data

Code
from pathlib import Path
import numpy as np
import pandas as pd
from scipy.special import expit
from python.bayes_glm import bayes_logit

root = Path("../../ROS-Examples")
wells = pd.read_csv(root / "Arsenic/data/wells.csv")
wells["dist100"] = wells["dist"] / 100
wells["educ4"] = wells["educ"] / 4
wells.head()
switch arsenic dist dist100 assoc educ educ4
0 1 2.36 16.826000 0.16826 0 0 0.0
1 1 0.71 47.321999 0.47322 0 0 0.0
2 0 2.07 20.966999 0.20967 0 10 2.5
3 1 1.15 21.486000 0.21486 0 12 3.0
4 1 1.10 40.874001 0.40874 1 14 3.5

Model with distance, arsenic, and education

The R page uses stan_glm; the Python port uses a Bayesian logistic regression with weakly informative Normal priors, then averages over posterior draws.

Code
fit_7 = bayes_logit("switch ~ dist100 + arsenic + educ4", data=wells, draws=2000, seed=1201)
fit_7.summary().round(2)
mean sd q5 q50 q95
Intercept -0.21 0.09 -0.37 -0.22 -0.06
dist100 -0.90 0.11 -1.07 -0.89 -0.72
arsenic 0.47 0.04 0.40 0.47 0.54
educ4 0.17 0.04 0.11 0.17 0.23

Average predictive comparisons

A helper for changing one covariate while holding each household’s other covariates fixed:

Code
def average_predictive_difference(fit, data, variable, lo, hi):
    lo_data = data.copy()
    hi_data = data.copy()
    lo_data[variable] = lo
    hi_data[variable] = hi
    return float((fit.epred(hi_data) - fit.epred(lo_data)).mean(axis=1).mean())

apc_distance = average_predictive_difference(fit_7, wells, "dist100", 0, 1)
apc_arsenic = average_predictive_difference(fit_7, wells, "arsenic", 0.5, 1.0)
apc_education = average_predictive_difference(fit_7, wells, "educ4", 0, 3)  # 0 vs 12 years
pd.Series({
    "distance: 0 to 100 meters": apc_distance,
    "arsenic: 0.5 to 1.0": apc_arsenic,
    "education: 0 to 12 years": apc_education,
}).round(2)
distance: 0 to 100 meters   -0.20
arsenic: 0.5 to 1.0          0.06
education: 0 to 12 years     0.12
dtype: float64

These are average changes in predicted probability, not regression coefficients. They are on the probability scale and therefore depend on the observed distribution of the other predictors.

Interaction model

The original also computes the distance APC after centering and adding interactions with education.

Code
wells["c_dist100"] = wells["dist100"] - wells["dist100"].mean()
wells["c_arsenic"] = wells["arsenic"] - wells["arsenic"].mean()
wells["c_educ4"] = wells["educ4"] - wells["educ4"].mean()
fit_8 = bayes_logit(
    "switch ~ c_dist100 + c_arsenic + c_educ4 + c_dist100:c_educ4 + c_arsenic:c_educ4",
    data=wells,
    draws=2000,
    seed=1202,
)
fit_8.summary().round(2)
mean sd q5 q50 q95
Intercept 0.34 0.04 0.28 0.34 0.41
c_dist100 -0.92 0.10 -1.09 -0.92 -0.74
c_arsenic 0.49 0.04 0.42 0.49 0.56
c_educ4 0.19 0.04 0.13 0.19 0.25
c_dist100:c_educ4 0.33 0.11 0.16 0.33 0.50
c_arsenic:c_educ4 0.08 0.04 0.01 0.08 0.15

For a centered predictor, c_dist100=0 is an average-distance household. The original comparison sets the centered distance predictor to 0 and 1, again averaging over observed arsenic and education.

Code
apc_distance_interaction = average_predictive_difference(fit_8, wells, "c_dist100", 0, 1)
round(apc_distance_interaction, 2)
-0.21

Direct formula check

The same APC can be computed by evaluating the fitted linear predictor by hand:

Code
b = fit_8.beta_draws
names = fit_8.coef_names
idx = {name: j for j, name in enumerate(names)}
hi, lo = 1, 0
eta_hi = (
    b[:, idx["Intercept"], None] + b[:, idx["c_dist100"], None] * hi + b[:, idx["c_arsenic"], None] * wells["c_arsenic"].to_numpy()
    + b[:, idx["c_educ4"], None] * wells["c_educ4"].to_numpy()
    + b[:, idx["c_dist100:c_educ4"], None] * hi * wells["c_educ4"].to_numpy()
    + b[:, idx["c_arsenic:c_educ4"], None] * wells["c_arsenic"].to_numpy() * wells["c_educ4"].to_numpy()
)
eta_lo = eta_hi - b[:, idx["c_dist100"], None] - b[:, idx["c_dist100:c_educ4"], None] * wells["c_educ4"].to_numpy()
round(float((expit(eta_hi) - expit(eta_lo)).mean()), 2)
-0.21