CmdStanPy diabetes logistic regression

Source notebook: diabetes.ipynb
Source Stan files: logistic_t.stan, logistic_hs.stan

The diabetes notebook is the BDA demo collection’s most direct bridge into the Applied Bayesian Computing spine. It takes a familiar binary-outcome regression, writes the model in Stan, samples with CmdStanPy, and uses ArviZ for posterior summaries and PSIS-LOO.

The data are the Pima Indians diabetes measurements. The notebook lowercases column names, removes rows where selected biomedical variables are coded as zero, standardizes predictors, and then models outcome as Bernoulli.

Baseline model

The first Stan model is logistic regression with Student-$t$ priors:

parameters {
  real alpha;
  vector[d] beta;
}
transformed parameters {
  vector[n] eta = alpha + X * beta;
}
model {
  alpha ~ student_t(p_alpha_df, p_alpha_loc, p_alpha_scale);
  beta ~ student_t(p_beta_df, p_beta_loc, p_beta_scale);
  y ~ bernoulli_logit(eta);
}
generated quantities {
  vector[n] log_lik;
  for (i in 1:n)
    log_lik[i] = bernoulli_logit_lpmf(y[i] | eta[i]);
}

The generated log_lik vector is not decorative. It is the interface to ArviZ model comparison:

Code

from pathlib import Path
import numpy as np
import pandas as pd
from cmdstanpy import CmdStanModel
import arviz as az

bda_root = Path("../../BDA_py_demos")
if not bda_root.exists():
    bda_root = Path("/Users/alal/tmp/ros-python-book/BDA_py_demos")

diabetes = pd.read_csv(bda_root / "utilities_and_data/diabetes.csv")
diabetes.columns = diabetes.columns.str.lower()
cols = ["glucose", "bloodpressure", "skinthickness", "insulin", "bmi", "diabetespedigreefunction", "age"]
diabetes = diabetes[(diabetes[cols] != 0).all(axis=1)].copy()
X = diabetes[cols]
X = (X - X.mean()) / X.std()
y = diabetes["outcome"].astype(int).to_numpy()
data1 = {
    "n": len(y), "d": X.shape[1], "X": X.to_numpy(), "y": y,
    "p_alpha_df": 7, "p_alpha_loc": 0, "p_alpha_scale": 2.5,
    "p_beta_df": 7, "p_beta_loc": 0, "p_beta_scale": 2.5,
}

model = CmdStanModel(stan_file=str(bda_root / "demos_cmdstanpy/logistic_t.stan"))
fit = model.sample(data=data1, seed=74749, chains=2, iter_warmup=300, iter_sampling=300, show_progress=False)
idata_t = az.from_cmdstanpy(fit)
az.summary(idata_t, var_names=["alpha", "beta"]).head()

	mean	sd	hdi_3%	hdi_97%	mcse_mean	mcse_sd	ess_bulk	ess_tail	r_hat
alpha	-1.005	0.154	-1.292	-0.742	0.005	0.007	926.0	367.0	1.01
beta[0]	1.209	0.173	0.867	1.528	0.007	0.006	536.0	402.0	1.00
beta[1]	-0.010	0.145	-0.267	0.277	0.005	0.006	909.0	422.0	1.00
beta[2]	0.130	0.181	-0.208	0.462	0.008	0.006	576.0	488.0	1.00
beta[3]	-0.111	0.164	-0.426	0.176	0.006	0.005	748.0	547.0	1.00

The prior choice, $t_7(0,2.5)$ on standardized coefficients, is a weakly informative default: most effects are expected near zero, but the prior still allows large log-odds shifts.

Diagnostics before interpretation

The notebook calls utility functions for divergences, energy, and treedepth, then checks effective sample sizes and $\hat R$. This order is the right habit:

Did HMC explore the posterior without geometry warnings?
Are chains mixed well enough for the summaries we want?
Only then, what do the coefficients and predictions imply?

A clean az.summary(..., kind="diagnostics") is not proof the model is true; it is permission to start interpreting the fitted posterior.

Horseshoe variant

The second Stan model replaces independent Student-$t$ coefficient priors with a hierarchical shrinkage prior. The implementation uses a non-centered auxiliary representation:

vector[d] z;
vector<lower=0>[d] lambda_r1;
vector<lower=0>[d] lambda_r2;
real<lower=0> tau_r1;
real<lower=0> tau_r2;

lambda = lambda_r1 .* sqrt(lambda_r2);
tau = tau_r1 * sqrt(tau_r2);
beta = z .* (lambda * tau);

The global scale is set from a prior guess about sparsity:

Code

p0 = 2  # prior guess for the number of relevant predictors
n, p = data1["n"], data1["d"]
tau0 = p0 / (p - p0) * 1 / np.sqrt(n)
tau0

np.float64(0.020203050891044218)

That line is the substantive modeling move. It says: before seeing the outcome, we expect only a small number of predictors to carry signal. The horseshoe then shrinks weak coefficients aggressively while allowing strong ones to escape.

What to preserve in the book translation

Keep the preprocessing visible. Standardization is part of the prior specification because coefficient scale determines the meaning of student_t(7, 0, 2.5).
Keep Stan code in separate .stan files for full examples; inline Stan snippets are fine for commentary, but real workflows should compile model files.
Always include log_lik in generated quantities for models that will be compared with PSIS-LOO.
Use the same ArviZ InferenceData path for all backends. CmdStanPy, BlackJAX, and future samplers should converge to comparable idata objects.

Relation to ROS examples

This is the Bayesian version of the logistic-regression examples in the ROS part of the book. The conceptual progression is:

fit a logistic regression and understand coefficients on the logit scale;
standardize predictors so weakly informative priors mean what they say;
run HMC and diagnose computation;
compare prior structures with PSIS-LOO;
use posterior predictive quantities rather than only coefficient tables.

The notebook is marked work-in-progress upstream, but it is already strong enough to serve as the house template for CmdStanPy binary regressions.

# CmdStanPy diabetes logistic regression Source notebook: [`diabetes.ipynb`](https://github.com/avehtari/BDA_py_demos/blob/master/demos_cmdstanpy/diabetes.ipynb) Source Stan files: `logistic_t.stan`, `logistic_hs.stan` The diabetes notebook is the BDA demo collection's most direct bridge into the Applied Bayesian Computing spine. It takes a familiar binary-outcome regression, writes the model in Stan, samples with CmdStanPy, and uses ArviZ for posterior summaries and PSIS-LOO. The data are the Pima Indians diabetes measurements. The notebook lowercases column names, removes rows where selected biomedical variables are coded as zero, standardizes predictors, and then models `outcome` as Bernoulli. ## Baseline model The first Stan model is logistic regression with Student-$t$ priors: ```stan parameters { real alpha; vector[d] beta; } transformed parameters { vector[n] eta = alpha + X * beta; } model { alpha ~ student_t(p_alpha_df, p_alpha_loc, p_alpha_scale); beta ~ student_t(p_beta_df, p_beta_loc, p_beta_scale); y ~ bernoulli_logit(eta); } generated quantities { vector[n] log_lik; for (i in 1:n) log_lik[i] = bernoulli_logit_lpmf(y[i] | eta[i]); } ``` The generated `log_lik` vector is not decorative. It is the interface to ArviZ model comparison: ```{python} from pathlib import Path import numpy as np import pandas as pd from cmdstanpy import CmdStanModel import arviz as az bda_root = Path("../../BDA_py_demos") if not bda_root.exists(): bda_root = Path("/Users/alal/tmp/ros-python-book/BDA_py_demos") diabetes = pd.read_csv(bda_root / "utilities_and_data/diabetes.csv") diabetes.columns = diabetes.columns.str.lower() cols = ["glucose", "bloodpressure", "skinthickness", "insulin", "bmi", "diabetespedigreefunction", "age"] diabetes = diabetes[(diabetes[cols] != 0).all(axis=1)].copy() X = diabetes[cols] X = (X - X.mean()) / X.std() y = diabetes["outcome"].astype(int).to_numpy() data1 = { "n": len(y), "d": X.shape[1], "X": X.to_numpy(), "y": y, "p_alpha_df": 7, "p_alpha_loc": 0, "p_alpha_scale": 2.5, "p_beta_df": 7, "p_beta_loc": 0, "p_beta_scale": 2.5, } model = CmdStanModel(stan_file=str(bda_root / "demos_cmdstanpy/logistic_t.stan")) fit = model.sample(data=data1, seed=74749, chains=2, iter_warmup=300, iter_sampling=300, show_progress=False) idata_t = az.from_cmdstanpy(fit) az.summary(idata_t, var_names=["alpha", "beta"]).head() ``` The prior choice, $t_7(0,2.5)$ on standardized coefficients, is a weakly informative default: most effects are expected near zero, but the prior still allows large log-odds shifts. ## Diagnostics before interpretation The notebook calls utility functions for divergences, energy, and treedepth, then checks effective sample sizes and $\hat R$. This order is the right habit: 1. Did HMC explore the posterior without geometry warnings? 2. Are chains mixed well enough for the summaries we want? 3. Only then, what do the coefficients and predictions imply? A clean `az.summary(..., kind="diagnostics")` is not proof the model is true; it is permission to start interpreting the fitted posterior. ## Horseshoe variant The second Stan model replaces independent Student-$t$ coefficient priors with a hierarchical shrinkage prior. The implementation uses a non-centered auxiliary representation: ```stan vector[d] z; vector<lower=0>[d] lambda_r1; vector<lower=0>[d] lambda_r2; real<lower=0> tau_r1; real<lower=0> tau_r2; lambda = lambda_r1 .* sqrt(lambda_r2); tau = tau_r1 * sqrt(tau_r2); beta = z .* (lambda * tau); ``` The global scale is set from a prior guess about sparsity: ```{python} p0 = 2 # prior guess for the number of relevant predictors n, p = data1["n"], data1["d"] tau0 = p0 / (p - p0) * 1 / np.sqrt(n) tau0 ``` That line is the substantive modeling move. It says: before seeing the outcome, we expect only a small number of predictors to carry signal. The horseshoe then shrinks weak coefficients aggressively while allowing strong ones to escape. ## What to preserve in the book translation - Keep the preprocessing visible. Standardization is part of the prior specification because coefficient scale determines the meaning of `student_t(7, 0, 2.5)`. - Keep Stan code in separate `.stan` files for full examples; inline Stan snippets are fine for commentary, but real workflows should compile model files. - Always include `log_lik` in generated quantities for models that will be compared with PSIS-LOO. - Use the same ArviZ `InferenceData` path for all backends. CmdStanPy, BlackJAX, and future samplers should converge to comparable `idata` objects. ## Relation to ROS examples This is the Bayesian version of the logistic-regression examples in the ROS part of the book. The conceptual progression is: 1. fit a logistic regression and understand coefficients on the logit scale; 2. standardize predictors so weakly informative priors mean what they say; 3. run HMC and diagnose computation; 4. compare prior structures with PSIS-LOO; 5. use posterior predictive quantities rather than only coefficient tables. The notebook is marked work-in-progress upstream, but it is already strong enough to serve as the house template for CmdStanPy binary regressions.