Source notebook: diabetes.ipynb
Source Stan files: logistic_t.stan, logistic_hs.stan
The diabetes notebook is the BDA demo collection’s most direct bridge into the Applied Bayesian Computing spine. It takes a familiar binary-outcome regression, writes the model in Stan, samples with CmdStanPy, and uses ArviZ for posterior summaries and PSIS-LOO.
The data are the Pima Indians diabetes measurements. The notebook lowercases column names, removes rows where selected biomedical variables are coded as zero, standardizes predictors, and then models outcome as Bernoulli.
Baseline model
The first Stan model is logistic regression with Student-\(t\) priors:
parameters {real alpha;vector[d] beta;}transformed parameters {vector[n] eta = alpha + X * beta;}model { alpha ~ student_t(p_alpha_df, p_alpha_loc, p_alpha_scale); beta ~ student_t(p_beta_df, p_beta_loc, p_beta_scale); y ~ bernoulli_logit(eta);}generated quantities {vector[n] log_lik;for (i in1:n) log_lik[i] = bernoulli_logit_lpmf(y[i] | eta[i]);}
The generated log_lik vector is not decorative. It is the interface to ArviZ model comparison:
The prior choice, \(t_7(0,2.5)\) on standardized coefficients, is a weakly informative default: most effects are expected near zero, but the prior still allows large log-odds shifts.
Diagnostics before interpretation
The notebook calls utility functions for divergences, energy, and treedepth, then checks effective sample sizes and \(\hat R\). This order is the right habit:
Did HMC explore the posterior without geometry warnings?
Are chains mixed well enough for the summaries we want?
Only then, what do the coefficients and predictions imply?
A clean az.summary(..., kind="diagnostics") is not proof the model is true; it is permission to start interpreting the fitted posterior.
Horseshoe variant
The second Stan model replaces independent Student-\(t\) coefficient priors with a hierarchical shrinkage prior. The implementation uses a non-centered auxiliary representation:
The global scale is set from a prior guess about sparsity:
Code
p0 =2# prior guess for the number of relevant predictorsn, p = data1["n"], data1["d"]tau0 = p0 / (p - p0) *1/ np.sqrt(n)tau0
np.float64(0.020203050891044218)
That line is the substantive modeling move. It says: before seeing the outcome, we expect only a small number of predictors to carry signal. The horseshoe then shrinks weak coefficients aggressively while allowing strong ones to escape.
What to preserve in the book translation
Keep the preprocessing visible. Standardization is part of the prior specification because coefficient scale determines the meaning of student_t(7, 0, 2.5).
Keep Stan code in separate .stan files for full examples; inline Stan snippets are fine for commentary, but real workflows should compile model files.
Always include log_lik in generated quantities for models that will be compared with PSIS-LOO.
Use the same ArviZ InferenceData path for all backends. CmdStanPy, BlackJAX, and future samplers should converge to comparable idata objects.
Relation to ROS examples
This is the Bayesian version of the logistic-regression examples in the ROS part of the book. The conceptual progression is:
fit a logistic regression and understand coefficients on the logit scale;
standardize predictors so weakly informative priors mean what they say;
run HMC and diagnose computation;
compare prior structures with PSIS-LOO;
use posterior predictive quantities rather than only coefficient tables.
The notebook is marked work-in-progress upstream, but it is already strong enough to serve as the house template for CmdStanPy binary regressions.
# CmdStanPy diabetes logistic regressionSource notebook: [`diabetes.ipynb`](https://github.com/avehtari/BDA_py_demos/blob/master/demos_cmdstanpy/diabetes.ipynb)Source Stan files: `logistic_t.stan`, `logistic_hs.stan`The diabetes notebook is the BDA demo collection's most direct bridge into the Applied Bayesian Computing spine. It takes a familiar binary-outcome regression, writes the model in Stan, samples with CmdStanPy, and uses ArviZ for posterior summaries and PSIS-LOO.The data are the Pima Indians diabetes measurements. The notebook lowercases column names, removes rows where selected biomedical variables are coded as zero, standardizes predictors, and then models `outcome` as Bernoulli.## Baseline modelThe first Stan model is logistic regression with Student-$t$ priors:```stanparameters { real alpha; vector[d] beta;}transformed parameters { vector[n] eta = alpha + X * beta;}model { alpha ~ student_t(p_alpha_df, p_alpha_loc, p_alpha_scale); beta ~ student_t(p_beta_df, p_beta_loc, p_beta_scale); y ~ bernoulli_logit(eta);}generated quantities { vector[n] log_lik; for (i in 1:n) log_lik[i] = bernoulli_logit_lpmf(y[i] | eta[i]);}```The generated `log_lik` vector is not decorative. It is the interface to ArviZ model comparison:```{python}from pathlib import Pathimport numpy as npimport pandas as pdfrom cmdstanpy import CmdStanModelimport arviz as azbda_root = Path("../../BDA_py_demos")ifnot bda_root.exists(): bda_root = Path("/Users/alal/tmp/ros-python-book/BDA_py_demos")diabetes = pd.read_csv(bda_root /"utilities_and_data/diabetes.csv")diabetes.columns = diabetes.columns.str.lower()cols = ["glucose", "bloodpressure", "skinthickness", "insulin", "bmi", "diabetespedigreefunction", "age"]diabetes = diabetes[(diabetes[cols] !=0).all(axis=1)].copy()X = diabetes[cols]X = (X - X.mean()) / X.std()y = diabetes["outcome"].astype(int).to_numpy()data1 = {"n": len(y), "d": X.shape[1], "X": X.to_numpy(), "y": y,"p_alpha_df": 7, "p_alpha_loc": 0, "p_alpha_scale": 2.5,"p_beta_df": 7, "p_beta_loc": 0, "p_beta_scale": 2.5,}model = CmdStanModel(stan_file=str(bda_root /"demos_cmdstanpy/logistic_t.stan"))fit = model.sample(data=data1, seed=74749, chains=2, iter_warmup=300, iter_sampling=300, show_progress=False)idata_t = az.from_cmdstanpy(fit)az.summary(idata_t, var_names=["alpha", "beta"]).head()```The prior choice, $t_7(0,2.5)$ on standardized coefficients, is a weakly informative default: most effects are expected near zero, but the prior still allows large log-odds shifts.## Diagnostics before interpretationThe notebook calls utility functions for divergences, energy, and treedepth, then checks effective sample sizes and $\hat R$. This order is the right habit:1. Did HMC explore the posterior without geometry warnings?2. Are chains mixed well enough for the summaries we want?3. Only then, what do the coefficients and predictions imply?A clean `az.summary(..., kind="diagnostics")` is not proof the model is true; it is permission to start interpreting the fitted posterior.## Horseshoe variantThe second Stan model replaces independent Student-$t$ coefficient priors with a hierarchical shrinkage prior. The implementation uses a non-centered auxiliary representation:```stanvector[d] z;vector<lower=0>[d] lambda_r1;vector<lower=0>[d] lambda_r2;real<lower=0> tau_r1;real<lower=0> tau_r2;lambda = lambda_r1 .* sqrt(lambda_r2);tau = tau_r1 * sqrt(tau_r2);beta = z .* (lambda * tau);```The global scale is set from a prior guess about sparsity:```{python}p0 =2# prior guess for the number of relevant predictorsn, p = data1["n"], data1["d"]tau0 = p0 / (p - p0) *1/ np.sqrt(n)tau0```That line is the substantive modeling move. It says: before seeing the outcome, we expect only a small number of predictors to carry signal. The horseshoe then shrinks weak coefficients aggressively while allowing strong ones to escape.## What to preserve in the book translation- Keep the preprocessing visible. Standardization is part of the prior specification because coefficient scale determines the meaning of `student_t(7, 0, 2.5)`.- Keep Stan code in separate `.stan` files for full examples; inline Stan snippets are fine for commentary, but real workflows should compile model files.- Always include `log_lik` in generated quantities for models that will be compared with PSIS-LOO.- Use the same ArviZ `InferenceData` path for all backends. CmdStanPy, BlackJAX, and future samplers should converge to comparable `idata` objects.## Relation to ROS examplesThis is the Bayesian version of the logistic-regression examples in the ROS part of the book. The conceptual progression is:1. fit a logistic regression and understand coefficients on the logit scale;2. standardize predictors so weakly informative priors mean what they say;3. run HMC and diagnose computation;4. compare prior structures with PSIS-LOO;5. use posterior predictive quantities rather than only coefficient tables.The notebook is marked work-in-progress upstream, but it is already strong enough to serve as the house template for CmdStanPy binary regressions.