KidIQ: k-fold cross-validation

Source: KidIQ/kidiq_kcv.Rmd

The R example fits kid_score ~ mom_hs + mom_iq with stan_glm, then compares leave-one-out and k-fold cross-validation. This Python port keeps the same predictive target and demonstrates k-fold CV with sklearn and statsmodels. A Bayesian CmdStanPy/ArviZ version would use the same folds but average predictive densities over posterior draws.

Setup and data

Code

from pathlib import Path
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
from scipy.stats import norm
from sklearn.model_selection import KFold, LeaveOneOut, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

root = Path("../../ROS-Examples")
kidiq = pd.read_csv(root / "KidIQ/data/kidiq.csv")
SEED = 1507
kidiq.head()

	kid_score	mom_hs	mom_iq	mom_work	mom_age
0	65	1	121.117529	4	27
1	98	1	89.361882	4	25
2	85	1	115.443165	4	27
3	83	1	99.449639	3	25
4	115	1	92.745710	4	27

Fit the original regression

Code

fit_3 = smf.ols("kid_score ~ mom_hs + mom_iq", data=kidiq).fit()
fit_3.summary()

OLS Regression Results
Dep. Variable:	kid_score	R-squared:	0.214
Model:	OLS	Adj. R-squared:	0.210
Method:	Least Squares	F-statistic:	58.72
Date:	Sun, 31 May 2026	Prob (F-statistic):	2.79e-23
Time:	23:03:12	Log-Likelihood:	-1872.0
No. Observations:	434	AIC:	3750.
Df Residuals:	431	BIC:	3762.
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	25.7315	5.875	4.380	0.000	14.184	37.279
mom_hs	5.9501	2.212	2.690	0.007	1.603	10.297
mom_iq	0.5639	0.061	9.309	0.000	0.445	0.683

Omnibus:	7.327	Durbin-Watson:	1.625
Prob(Omnibus):	0.026	Jarque-Bera (JB):	7.530
Skew:	-0.313	Prob(JB):	0.0232
Kurtosis:	2.845	Cond. No.	683.

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Code

def gaussian_log_score(y, mu, sigma):
    return float(norm.logpdf(y, loc=mu, scale=sigma).sum())

plugin_log_score = gaussian_log_score(
    kidiq["kid_score"],
    fit_3.fittedvalues,
    np.sqrt(fit_3.scale),
)
round(plugin_log_score, 1)

-1872.0

K-fold cross-validation by hand

The function below refits the model on each training fold and evaluates the held-out Gaussian log predictive density. The residual standard deviation is estimated on the training fold.

Code

def kfold_log_score(data, formula, k=10, seed=SEED):
    kf = KFold(n_splits=k, shuffle=True, random_state=seed)
    pointwise = np.empty(len(data))
    for train_idx, test_idx in kf.split(data):
        train = data.iloc[train_idx]
        test = data.iloc[test_idx]
        fit = smf.ols(formula, data=train).fit()
        mu = fit.predict(test)
        sigma = np.sqrt(fit.scale)
        pointwise[test_idx] = norm.logpdf(test["kid_score"], loc=mu, scale=sigma)
    return pointwise

kcv_i = kfold_log_score(kidiq, "kid_score ~ mom_hs + mom_iq", k=10)
kcv_i.sum().round(1), kcv_i.mean().round(3)

(np.float64(-1875.3), np.float64(-4.321))

Leave-one-out cross-validation

For ordinary least squares, exact LOO predictions can be computed from the hat values, avoiding 434 separate refits.

Code

infl = fit_3.get_influence()
h = infl.hat_matrix_diag
loo_resid = fit_3.resid / (1 - h)
loo_mu = kidiq["kid_score"] - loo_resid
loo_sigma = np.sqrt(((loo_resid ** 2).sum()) / (len(kidiq) - fit_3.df_model - 1))
loo_i = norm.logpdf(kidiq["kid_score"], loc=loo_mu, scale=loo_sigma)
loo_i.sum().round(1), loo_i.mean().round(3)

(np.float64(-1875.1), np.float64(-4.321))

Same calculation with scikit-learn scoring

cross_val_score reports a mean squared error by default for regression. This checks that the fold structure tells the same story on the more familiar RMSE scale.

Code

X = kidiq[["mom_hs", "mom_iq"]]
y = kidiq["kid_score"]
model = make_pipeline(StandardScaler(with_mean=True, with_std=True), LinearRegression())
neg_mse = cross_val_score(
    model, X, y,
    cv=KFold(n_splits=10, shuffle=True, random_state=SEED),
    scoring="neg_mean_squared_error",
)
rmse_by_fold = np.sqrt(-neg_mse)
pd.Series(rmse_by_fold).describe().round(2)

count    10.00
mean     18.12
std       1.52
min      15.27
25%      17.49
50%      18.03
75%      19.49
max      20.03
dtype: float64

Compare a few formulas

Code

formulas = {
    "mom_hs": "kid_score ~ mom_hs",
    "mom_hs + mom_iq": "kid_score ~ mom_hs + mom_iq",
    "+ interaction": "kid_score ~ mom_hs + mom_iq + mom_hs:mom_iq",
    "log(mom_iq) interaction": "kid_score ~ mom_hs + np.log(mom_iq) + mom_hs:np.log(mom_iq)",
}
comparison = pd.DataFrame({
    name: {
        "kfold_elpd": kfold_log_score(kidiq, formula, k=10).sum(),
        "in_sample_log_score": gaussian_log_score(
            kidiq["kid_score"],
            smf.ols(formula, data=kidiq).fit().fittedvalues,
            np.sqrt(smf.ols(formula, data=kidiq).fit().scale),
        ),
    }
    for name, formula in formulas.items()
}).T
comparison.sort_values("kfold_elpd", ascending=False).round(1)

	kfold_elpd	in_sample_log_score
log(mom_iq) interaction	-1870.9	-1866.1
+ interaction	-1872.4	-1867.6
mom_hs + mom_iq	-1875.3	-1872.0
mom_hs	-1914.4	-1911.8

Bayesian extension

To mirror stan_glm exactly, fit a Gaussian linear regression in CmdStanPy with generated quantities

generated quantities {
  vector[N] log_lik;
  for (n in 1:N) log_lik[n] = normal_lpdf(y[n] | X[n] * beta, sigma);
}

Then call az.loo(idata) for PSIS-LOO or repeat the KFold loop above, fitting on each training fold and evaluating posterior predictive log densities on the held-out fold. The mechanics of the folds are the same as in the Python code above; the only difference is replacing plug-in normal densities with posterior-averaged predictive densities.

# KidIQ: k-fold cross-validation Source: `KidIQ/kidiq_kcv.Rmd` The R example fits `kid_score ~ mom_hs + mom_iq` with `stan_glm`, then compares leave-one-out and k-fold cross-validation. This Python port keeps the same predictive target and demonstrates k-fold CV with `sklearn` and `statsmodels`. A Bayesian CmdStanPy/ArviZ version would use the same folds but average predictive densities over posterior draws. ## Setup and data ```{python} from pathlib import Path import numpy as np import pandas as pd import statsmodels.formula.api as smf from scipy.stats import norm from sklearn.model_selection import KFold, LeaveOneOut, cross_val_score from sklearn.linear_model import LinearRegression from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler root = Path("../../ROS-Examples") kidiq = pd.read_csv(root / "KidIQ/data/kidiq.csv") SEED = 1507 kidiq.head() ``` ## Fit the original regression ```{python} fit_3 = smf.ols("kid_score ~ mom_hs + mom_iq", data=kidiq).fit() fit_3.summary() ``` ```{python} def gaussian_log_score(y, mu, sigma): return float(norm.logpdf(y, loc=mu, scale=sigma).sum()) plugin_log_score = gaussian_log_score( kidiq["kid_score"], fit_3.fittedvalues, np.sqrt(fit_3.scale), ) round(plugin_log_score, 1) ``` ## K-fold cross-validation by hand The function below refits the model on each training fold and evaluates the held-out Gaussian log predictive density. The residual standard deviation is estimated on the training fold. ```{python} def kfold_log_score(data, formula, k=10, seed=SEED): kf = KFold(n_splits=k, shuffle=True, random_state=seed) pointwise = np.empty(len(data)) for train_idx, test_idx in kf.split(data): train = data.iloc[train_idx] test = data.iloc[test_idx] fit = smf.ols(formula, data=train).fit() mu = fit.predict(test) sigma = np.sqrt(fit.scale) pointwise[test_idx] = norm.logpdf(test["kid_score"], loc=mu, scale=sigma) return pointwise kcv_i = kfold_log_score(kidiq, "kid_score ~ mom_hs + mom_iq", k=10) kcv_i.sum().round(1), kcv_i.mean().round(3) ``` ## Leave-one-out cross-validation For ordinary least squares, exact LOO predictions can be computed from the hat values, avoiding 434 separate refits. ```{python} infl = fit_3.get_influence() h = infl.hat_matrix_diag loo_resid = fit_3.resid / (1 - h) loo_mu = kidiq["kid_score"] - loo_resid loo_sigma = np.sqrt(((loo_resid ** 2).sum()) / (len(kidiq) - fit_3.df_model - 1)) loo_i = norm.logpdf(kidiq["kid_score"], loc=loo_mu, scale=loo_sigma) loo_i.sum().round(1), loo_i.mean().round(3) ``` ## Same calculation with scikit-learn scoring `cross_val_score` reports a mean squared error by default for regression. This checks that the fold structure tells the same story on the more familiar RMSE scale. ```{python} X = kidiq[["mom_hs", "mom_iq"]] y = kidiq["kid_score"] model = make_pipeline(StandardScaler(with_mean=True, with_std=True), LinearRegression()) neg_mse = cross_val_score( model, X, y, cv=KFold(n_splits=10, shuffle=True, random_state=SEED), scoring="neg_mean_squared_error", ) rmse_by_fold = np.sqrt(-neg_mse) pd.Series(rmse_by_fold).describe().round(2) ``` ## Compare a few formulas ```{python} formulas = { "mom_hs": "kid_score ~ mom_hs", "mom_hs + mom_iq": "kid_score ~ mom_hs + mom_iq", "+ interaction": "kid_score ~ mom_hs + mom_iq + mom_hs:mom_iq", "log(mom_iq) interaction": "kid_score ~ mom_hs + np.log(mom_iq) + mom_hs:np.log(mom_iq)", } comparison = pd.DataFrame({ name: { "kfold_elpd": kfold_log_score(kidiq, formula, k=10).sum(), "in_sample_log_score": gaussian_log_score( kidiq["kid_score"], smf.ols(formula, data=kidiq).fit().fittedvalues, np.sqrt(smf.ols(formula, data=kidiq).fit().scale), ), } for name, formula in formulas.items() }).T comparison.sort_values("kfold_elpd", ascending=False).round(1) ``` ## Bayesian extension To mirror `stan_glm` exactly, fit a Gaussian linear regression in CmdStanPy with generated quantities ```stan generated quantities { vector[N] log_lik; for (n in 1:N) log_lik[n] = normal_lpdf(y[n] | X[n] * beta, sigma); } ``` Then call `az.loo(idata)` for PSIS-LOO or repeat the `KFold` loop above, fitting on each training fold and evaluating posterior predictive log densities on the held-out fold. The mechanics of the folds are the same as in the Python code above; the only difference is replacing plug-in normal densities with posterior-averaged predictive densities.