KidIQ: multiple linear regression

Source: KidIQ/kidiq.Rmd

This page ports the core rstanarm::stan_glm examples to Python. The classical fit uses statsmodels; the Bayesian version uses CmdStanPy with weakly informative priors.

Load data

Code
from pathlib import Path
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

root = Path("../../ROS-Examples")
kidiq = pd.read_csv(root / "KidIQ/data/kidiq.csv")
kidiq.head()
kid_score mom_hs mom_iq mom_work mom_age
0 65 1 121.117529 4 27
1 98 1 89.361882 4 25
2 85 1 115.443165 4 27
3 83 1 99.449639 3 25
4 115 1 92.745710 4 27

Single binary predictor

R original:

stan_glm(kid_score ~ mom_hs, data=kidiq)

Python:

Code
fit_1 = smf.ols("kid_score ~ mom_hs", data=kidiq).fit()
fit_1.summary()
OLS Regression Results
Dep. Variable: kid_score R-squared: 0.056
Model: OLS Adj. R-squared: 0.054
Method: Least Squares F-statistic: 25.69
Date: Sun, 31 May 2026 Prob (F-statistic): 5.96e-07
Time: 23:02:54 Log-Likelihood: -1911.8
No. Observations: 434 AIC: 3828.
Df Residuals: 432 BIC: 3836.
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 77.5484 2.059 37.670 0.000 73.502 81.595
mom_hs 11.7713 2.322 5.069 0.000 7.207 16.336
Omnibus: 11.077 Durbin-Watson: 1.464
Prob(Omnibus): 0.004 Jarque-Bera (JB): 11.316
Skew: -0.373 Prob(JB): 0.00349
Kurtosis: 2.738 Cond. No. 4.11


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Single continuous predictor

Code
fit_2 = smf.ols("kid_score ~ mom_iq", data=kidiq).fit()
fit_2.params
Intercept    25.799778
mom_iq        0.609975
dtype: float64
Code
ax = kidiq.plot.scatter("mom_iq", "kid_score", alpha=0.7)
xs = np.linspace(kidiq.mom_iq.min(), kidiq.mom_iq.max(), 100)
ax.plot(xs, fit_2.params["Intercept"] + fit_2.params["mom_iq"] * xs, color="black")
ax.set_xlabel("Mother IQ score")
ax.set_ylabel("Child test score")
Text(0, 0.5, 'Child test score')

Two predictors

Code
fit_3 = smf.ols("kid_score ~ mom_hs + mom_iq", data=kidiq).fit()
fit_3.params
Intercept    25.731538
mom_hs        5.950117
mom_iq        0.563906
dtype: float64

Two fitted lines, no interaction:

Code
fig, ax = plt.subplots()
colors = np.where(kidiq.mom_hs == 1, "black", "gray")
ax.scatter(kidiq.mom_iq, kidiq.kid_score, c=colors, s=18)
for hs, color in [(0, "gray"), (1, "black")]:
    ax.plot(xs, fit_3.params["Intercept"] + fit_3.params["mom_hs"]*hs + fit_3.params["mom_iq"]*xs, color=color)
ax.set_xlabel("Mother IQ score")
ax.set_ylabel("Child test score")
Text(0, 0.5, 'Child test score')

Interaction model

Code
fit_4 = smf.ols("kid_score ~ mom_hs * mom_iq", data=kidiq).fit()
fit_4.params
Intercept       -11.482021
mom_hs           51.268223
mom_iq            0.968889
mom_hs:mom_iq    -0.484275
dtype: float64

CmdStanPy equivalent

For a direct Bayesian analog to stan_glm, use a Gaussian regression with weak priors.

data {
  int<lower=1> N;
  int<lower=1> K;
  matrix[N, K] X;
  vector[N] y;
}
parameters {
  vector[K] beta;
  real<lower=0> sigma;
}
model {
  beta ~ normal(0, 10);
  sigma ~ exponential(1);
  y ~ normal(X * beta, sigma);
}
Code
# Build X with intercept, mom_hs, mom_iq and sample using CmdStanPy.
# This is the CmdStanPy replacement for rstanarm::stan_glm(kid_score ~ mom_hs + mom_iq).

BlackJAX relevance

This example is a good teaching case for writing the Gaussian regression log density by hand, but CmdStanPy or statsmodels is preferable for the main exposition. BlackJAX becomes useful when we want explicit NUTS mechanics or JAX-vectorized repeated simulations.