KidIQ: k-fold cross-validation

Source: KidIQ/kidiq_kcv.Rmd

The R example fits kid_score ~ mom_hs + mom_iq with stan_glm, then compares leave-one-out and k-fold cross-validation. This Python port keeps the same predictive target and demonstrates k-fold CV with sklearn and statsmodels. A Bayesian CmdStanPy/ArviZ version would use the same folds but average predictive densities over posterior draws.

Setup and data

Code
from pathlib import Path
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from scipy.stats import norm
from sklearn.model_selection import KFold, LeaveOneOut, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

root = Path("../../ROS-Examples")
kidiq = pd.read_csv(root / "KidIQ/data/kidiq.csv")
SEED = 1507
kidiq.head()
kid_score mom_hs mom_iq mom_work mom_age
0 65 1 121.117529 4 27
1 98 1 89.361882 4 25
2 85 1 115.443165 4 27
3 83 1 99.449639 3 25
4 115 1 92.745710 4 27

Fit the original regression

Code
fit_3 = smf.ols("kid_score ~ mom_hs + mom_iq", data=kidiq).fit()
fit_3.summary()
OLS Regression Results
Dep. Variable: kid_score R-squared: 0.214
Model: OLS Adj. R-squared: 0.210
Method: Least Squares F-statistic: 58.72
Date: Sun, 31 May 2026 Prob (F-statistic): 2.79e-23
Time: 23:03:12 Log-Likelihood: -1872.0
No. Observations: 434 AIC: 3750.
Df Residuals: 431 BIC: 3762.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 25.7315 5.875 4.380 0.000 14.184 37.279
mom_hs 5.9501 2.212 2.690 0.007 1.603 10.297
mom_iq 0.5639 0.061 9.309 0.000 0.445 0.683
Omnibus: 7.327 Durbin-Watson: 1.625
Prob(Omnibus): 0.026 Jarque-Bera (JB): 7.530
Skew: -0.313 Prob(JB): 0.0232
Kurtosis: 2.845 Cond. No. 683.


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Code
def gaussian_log_score(y, mu, sigma):
    return float(norm.logpdf(y, loc=mu, scale=sigma).sum())

plugin_log_score = gaussian_log_score(
    kidiq["kid_score"],
    fit_3.fittedvalues,
    np.sqrt(fit_3.scale),
)
round(plugin_log_score, 1)
-1872.0

K-fold cross-validation by hand

The function below refits the model on each training fold and evaluates the held-out Gaussian log predictive density. The residual standard deviation is estimated on the training fold.

Code
def kfold_log_score(data, formula, k=10, seed=SEED):
    kf = KFold(n_splits=k, shuffle=True, random_state=seed)
    pointwise = np.empty(len(data))
    for train_idx, test_idx in kf.split(data):
        train = data.iloc[train_idx]
        test = data.iloc[test_idx]
        fit = smf.ols(formula, data=train).fit()
        mu = fit.predict(test)
        sigma = np.sqrt(fit.scale)
        pointwise[test_idx] = norm.logpdf(test["kid_score"], loc=mu, scale=sigma)
    return pointwise

kcv_i = kfold_log_score(kidiq, "kid_score ~ mom_hs + mom_iq", k=10)
kcv_i.sum().round(1), kcv_i.mean().round(3)
(np.float64(-1875.3), np.float64(-4.321))

Leave-one-out cross-validation

For ordinary least squares, exact LOO predictions can be computed from the hat values, avoiding 434 separate refits.

Code
infl = fit_3.get_influence()
h = infl.hat_matrix_diag
loo_resid = fit_3.resid / (1 - h)
loo_mu = kidiq["kid_score"] - loo_resid
loo_sigma = np.sqrt(((loo_resid ** 2).sum()) / (len(kidiq) - fit_3.df_model - 1))
loo_i = norm.logpdf(kidiq["kid_score"], loc=loo_mu, scale=loo_sigma)
loo_i.sum().round(1), loo_i.mean().round(3)
(np.float64(-1875.1), np.float64(-4.321))

Same calculation with scikit-learn scoring

cross_val_score reports a mean squared error by default for regression. This checks that the fold structure tells the same story on the more familiar RMSE scale.

Code
X = kidiq[["mom_hs", "mom_iq"]]
y = kidiq["kid_score"]
model = make_pipeline(StandardScaler(with_mean=True, with_std=True), LinearRegression())
neg_mse = cross_val_score(
    model, X, y,
    cv=KFold(n_splits=10, shuffle=True, random_state=SEED),
    scoring="neg_mean_squared_error",
)
rmse_by_fold = np.sqrt(-neg_mse)
pd.Series(rmse_by_fold).describe().round(2)
count    10.00
mean     18.12
std       1.52
min      15.27
25%      17.49
50%      18.03
75%      19.49
max      20.03
dtype: float64

Compare a few formulas

Code
formulas = {
    "mom_hs": "kid_score ~ mom_hs",
    "mom_hs + mom_iq": "kid_score ~ mom_hs + mom_iq",
    "+ interaction": "kid_score ~ mom_hs + mom_iq + mom_hs:mom_iq",
    "log(mom_iq) interaction": "kid_score ~ mom_hs + np.log(mom_iq) + mom_hs:np.log(mom_iq)",
}
comparison = pd.DataFrame({
    name: {
        "kfold_elpd": kfold_log_score(kidiq, formula, k=10).sum(),
        "in_sample_log_score": gaussian_log_score(
            kidiq["kid_score"],
            smf.ols(formula, data=kidiq).fit().fittedvalues,
            np.sqrt(smf.ols(formula, data=kidiq).fit().scale),
        ),
    }
    for name, formula in formulas.items()
}).T
comparison.sort_values("kfold_elpd", ascending=False).round(1)
kfold_elpd in_sample_log_score
log(mom_iq) interaction -1870.9 -1866.1
+ interaction -1872.4 -1867.6
mom_hs + mom_iq -1875.3 -1872.0
mom_hs -1914.4 -1911.8

Bayesian extension

To mirror stan_glm exactly, fit a Gaussian linear regression in CmdStanPy with generated quantities

generated quantities {
  vector[N] log_lik;
  for (n in 1:N) log_lik[n] = normal_lpdf(y[n] | X[n] * beta, sigma);
}

Then call az.loo(idata) for PSIS-LOO or repeat the KFold loop above, fitting on each training fold and evaluating posterior predictive log densities on the held-out fold. The mechanics of the folds are the same as in the Python code above; the only difference is replacing plug-in normal densities with posterior-averaged predictive densities.