KidIQ: multiple linear regression

Source: KidIQ/kidiq.Rmd

This page ports the core rstanarm::stan_glm examples to Python. The classical fit uses statsmodels; the Bayesian version uses CmdStanPy with weakly informative priors.

Load data

Code

from pathlib import Path
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

root = Path("../../ROS-Examples")
kidiq = pd.read_csv(root / "KidIQ/data/kidiq.csv")
kidiq.head()

	kid_score	mom_hs	mom_iq	mom_work	mom_age
0	65	1	121.117529	4	27
1	98	1	89.361882	4	25
2	85	1	115.443165	4	27
3	83	1	99.449639	3	25
4	115	1	92.745710	4	27

Single binary predictor

R original:

stan_glm(kid_score ~ mom_hs, data=kidiq)

Python:

Code

fit_1 = smf.ols("kid_score ~ mom_hs", data=kidiq).fit()
fit_1.summary()

OLS Regression Results
Dep. Variable:	kid_score	R-squared:	0.056
Model:	OLS	Adj. R-squared:	0.054
Method:	Least Squares	F-statistic:	25.69
Date:	Sun, 31 May 2026	Prob (F-statistic):	5.96e-07
Time:	23:02:54	Log-Likelihood:	-1911.8
No. Observations:	434	AIC:	3828.
Df Residuals:	432	BIC:	3836.
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	77.5484	2.059	37.670	0.000	73.502	81.595
mom_hs	11.7713	2.322	5.069	0.000	7.207	16.336

Omnibus:	11.077	Durbin-Watson:	1.464
Prob(Omnibus):	0.004	Jarque-Bera (JB):	11.316
Skew:	-0.373	Prob(JB):	0.00349
Kurtosis:	2.738	Cond. No.	4.11

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Single continuous predictor

Code

fit_2 = smf.ols("kid_score ~ mom_iq", data=kidiq).fit()
fit_2.params

Intercept    25.799778
mom_iq        0.609975
dtype: float64

Code

ax = kidiq.plot.scatter("mom_iq", "kid_score", alpha=0.7)
xs = np.linspace(kidiq.mom_iq.min(), kidiq.mom_iq.max(), 100)
ax.plot(xs, fit_2.params["Intercept"] + fit_2.params["mom_iq"] * xs, color="black")
ax.set_xlabel("Mother IQ score")
ax.set_ylabel("Child test score")

Text(0, 0.5, 'Child test score')

Two predictors

Code

fit_3 = smf.ols("kid_score ~ mom_hs + mom_iq", data=kidiq).fit()
fit_3.params

Intercept    25.731538
mom_hs        5.950117
mom_iq        0.563906
dtype: float64

Two fitted lines, no interaction:

Code

fig, ax = plt.subplots()
colors = np.where(kidiq.mom_hs == 1, "black", "gray")
ax.scatter(kidiq.mom_iq, kidiq.kid_score, c=colors, s=18)
for hs, color in [(0, "gray"), (1, "black")]:
    ax.plot(xs, fit_3.params["Intercept"] + fit_3.params["mom_hs"]*hs + fit_3.params["mom_iq"]*xs, color=color)
ax.set_xlabel("Mother IQ score")
ax.set_ylabel("Child test score")

Text(0, 0.5, 'Child test score')

Interaction model

Code

fit_4 = smf.ols("kid_score ~ mom_hs * mom_iq", data=kidiq).fit()
fit_4.params

Intercept       -11.482021
mom_hs           51.268223
mom_iq            0.968889
mom_hs:mom_iq    -0.484275
dtype: float64

CmdStanPy equivalent

For a direct Bayesian analog to stan_glm, use a Gaussian regression with weak priors.

data {
  int<lower=1> N;
  int<lower=1> K;
  matrix[N, K] X;
  vector[N] y;
}
parameters {
  vector[K] beta;
  real<lower=0> sigma;
}
model {
  beta ~ normal(0, 10);
  sigma ~ exponential(1);
  y ~ normal(X * beta, sigma);
}

Code

# Build X with intercept, mom_hs, mom_iq and sample using CmdStanPy.
# This is the CmdStanPy replacement for rstanarm::stan_glm(kid_score ~ mom_hs + mom_iq).

BlackJAX relevance

This example is a good teaching case for writing the Gaussian regression log density by hand, but CmdStanPy or statsmodels is preferable for the main exposition. BlackJAX becomes useful when we want explicit NUTS mechanics or JAX-vectorized repeated simulations.

# KidIQ: multiple linear regression Source: `KidIQ/kidiq.Rmd` This page ports the core `rstanarm::stan_glm` examples to Python. The classical fit uses `statsmodels`; the Bayesian version uses CmdStanPy with weakly informative priors. ## Load data ```{python} from pathlib import Path import pandas as pd import numpy as np import statsmodels.formula.api as smf import matplotlib.pyplot as plt root = Path("../../ROS-Examples") kidiq = pd.read_csv(root / "KidIQ/data/kidiq.csv") kidiq.head() ``` ## Single binary predictor R original: ```r stan_glm(kid_score ~ mom_hs, data=kidiq) ``` Python: ```{python} fit_1 = smf.ols("kid_score ~ mom_hs", data=kidiq).fit() fit_1.summary() ``` ## Single continuous predictor ```{python} fit_2 = smf.ols("kid_score ~ mom_iq", data=kidiq).fit() fit_2.params ``` ```{python} ax = kidiq.plot.scatter("mom_iq", "kid_score", alpha=0.7) xs = np.linspace(kidiq.mom_iq.min(), kidiq.mom_iq.max(), 100) ax.plot(xs, fit_2.params["Intercept"] + fit_2.params["mom_iq"] * xs, color="black") ax.set_xlabel("Mother IQ score") ax.set_ylabel("Child test score") ``` ## Two predictors ```{python} fit_3 = smf.ols("kid_score ~ mom_hs + mom_iq", data=kidiq).fit() fit_3.params ``` Two fitted lines, no interaction: ```{python} fig, ax = plt.subplots() colors = np.where(kidiq.mom_hs == 1, "black", "gray") ax.scatter(kidiq.mom_iq, kidiq.kid_score, c=colors, s=18) for hs, color in [(0, "gray"), (1, "black")]: ax.plot(xs, fit_3.params["Intercept"] + fit_3.params["mom_hs"]*hs + fit_3.params["mom_iq"]*xs, color=color) ax.set_xlabel("Mother IQ score") ax.set_ylabel("Child test score") ``` ## Interaction model ```{python} fit_4 = smf.ols("kid_score ~ mom_hs * mom_iq", data=kidiq).fit() fit_4.params ``` ## CmdStanPy equivalent For a direct Bayesian analog to `stan_glm`, use a Gaussian regression with weak priors. ```stan data { int<lower=1> N; int<lower=1> K; matrix[N, K] X; vector[N] y; } parameters { vector[K] beta; real<lower=0> sigma; } model { beta ~ normal(0, 10); sigma ~ exponential(1); y ~ normal(X * beta, sigma); } ``` ```{python} # Build X with intercept, mom_hs, mom_iq and sample using CmdStanPy. # This is the CmdStanPy replacement for rstanarm::stan_glm(kid_score ~ mom_hs + mom_iq). ``` ## BlackJAX relevance This example is a good teaching case for writing the Gaussian regression log density by hand, but CmdStanPy or statsmodels is preferable for the main exposition. BlackJAX becomes useful when we want explicit NUTS mechanics or JAX-vectorized repeated simulations.