The R example fits kid_score ~ mom_hs + mom_iq with stan_glm, then compares leave-one-out and k-fold cross-validation. This Python port keeps the same predictive target and demonstrates k-fold CV with sklearn and statsmodels. A Bayesian CmdStanPy/ArviZ version would use the same folds but average predictive densities over posterior draws.
Setup and data
Code
from pathlib import Pathimport numpy as npimport pandas as pdimport statsmodels.formula.api as smffrom scipy.stats import normfrom sklearn.model_selection import KFold, LeaveOneOut, cross_val_scorefrom sklearn.linear_model import LinearRegressionfrom sklearn.pipeline import make_pipelinefrom sklearn.preprocessing import StandardScalerroot = Path("../../ROS-Examples")kidiq = pd.read_csv(root /"KidIQ/data/kidiq.csv")SEED =1507kidiq.head()
The function below refits the model on each training fold and evaluates the held-out Gaussian log predictive density. The residual standard deviation is estimated on the training fold.
cross_val_score reports a mean squared error by default for regression. This checks that the fold structure tells the same story on the more familiar RMSE scale.
Then call az.loo(idata) for PSIS-LOO or repeat the KFold loop above, fitting on each training fold and evaluating posterior predictive log densities on the held-out fold. The mechanics of the folds are the same as in the Python code above; the only difference is replacing plug-in normal densities with posterior-averaged predictive densities.
# KidIQ: k-fold cross-validationSource: `KidIQ/kidiq_kcv.Rmd`The R example fits `kid_score ~ mom_hs + mom_iq` with `stan_glm`, then compares leave-one-out and k-fold cross-validation. This Python port keeps the same predictive target and demonstrates k-fold CV with `sklearn` and `statsmodels`. A Bayesian CmdStanPy/ArviZ version would use the same folds but average predictive densities over posterior draws.## Setup and data```{python}from pathlib import Pathimport numpy as npimport pandas as pdimport statsmodels.formula.api as smffrom scipy.stats import normfrom sklearn.model_selection import KFold, LeaveOneOut, cross_val_scorefrom sklearn.linear_model import LinearRegressionfrom sklearn.pipeline import make_pipelinefrom sklearn.preprocessing import StandardScalerroot = Path("../../ROS-Examples")kidiq = pd.read_csv(root /"KidIQ/data/kidiq.csv")SEED =1507kidiq.head()```## Fit the original regression```{python}fit_3 = smf.ols("kid_score ~ mom_hs + mom_iq", data=kidiq).fit()fit_3.summary()``````{python}def gaussian_log_score(y, mu, sigma):returnfloat(norm.logpdf(y, loc=mu, scale=sigma).sum())plugin_log_score = gaussian_log_score( kidiq["kid_score"], fit_3.fittedvalues, np.sqrt(fit_3.scale),)round(plugin_log_score, 1)```## K-fold cross-validation by handThe function below refits the model on each training fold and evaluates the held-out Gaussian log predictive density. The residual standard deviation is estimated on the training fold.```{python}def kfold_log_score(data, formula, k=10, seed=SEED): kf = KFold(n_splits=k, shuffle=True, random_state=seed) pointwise = np.empty(len(data))for train_idx, test_idx in kf.split(data): train = data.iloc[train_idx] test = data.iloc[test_idx] fit = smf.ols(formula, data=train).fit() mu = fit.predict(test) sigma = np.sqrt(fit.scale) pointwise[test_idx] = norm.logpdf(test["kid_score"], loc=mu, scale=sigma)return pointwisekcv_i = kfold_log_score(kidiq, "kid_score ~ mom_hs + mom_iq", k=10)kcv_i.sum().round(1), kcv_i.mean().round(3)```## Leave-one-out cross-validationFor ordinary least squares, exact LOO predictions can be computed from the hat values, avoiding 434 separate refits.```{python}infl = fit_3.get_influence()h = infl.hat_matrix_diagloo_resid = fit_3.resid / (1- h)loo_mu = kidiq["kid_score"] - loo_residloo_sigma = np.sqrt(((loo_resid **2).sum()) / (len(kidiq) - fit_3.df_model -1))loo_i = norm.logpdf(kidiq["kid_score"], loc=loo_mu, scale=loo_sigma)loo_i.sum().round(1), loo_i.mean().round(3)```## Same calculation with scikit-learn scoring`cross_val_score` reports a mean squared error by default for regression. This checks that the fold structure tells the same story on the more familiar RMSE scale.```{python}X = kidiq[["mom_hs", "mom_iq"]]y = kidiq["kid_score"]model = make_pipeline(StandardScaler(with_mean=True, with_std=True), LinearRegression())neg_mse = cross_val_score( model, X, y, cv=KFold(n_splits=10, shuffle=True, random_state=SEED), scoring="neg_mean_squared_error",)rmse_by_fold = np.sqrt(-neg_mse)pd.Series(rmse_by_fold).describe().round(2)```## Compare a few formulas```{python}formulas = {"mom_hs": "kid_score ~ mom_hs","mom_hs + mom_iq": "kid_score ~ mom_hs + mom_iq","+ interaction": "kid_score ~ mom_hs + mom_iq + mom_hs:mom_iq","log(mom_iq) interaction": "kid_score ~ mom_hs + np.log(mom_iq) + mom_hs:np.log(mom_iq)",}comparison = pd.DataFrame({ name: {"kfold_elpd": kfold_log_score(kidiq, formula, k=10).sum(),"in_sample_log_score": gaussian_log_score( kidiq["kid_score"], smf.ols(formula, data=kidiq).fit().fittedvalues, np.sqrt(smf.ols(formula, data=kidiq).fit().scale), ), }for name, formula in formulas.items()}).Tcomparison.sort_values("kfold_elpd", ascending=False).round(1)```## Bayesian extensionTo mirror `stan_glm` exactly, fit a Gaussian linear regression in CmdStanPy with generated quantities```stangenerated quantities { vector[N] log_lik; for (n in 1:N) log_lik[n] = normal_lpdf(y[n] | X[n] * beta, sigma);}```Then call `az.loo(idata)` for PSIS-LOO or repeat the `KFold` loop above, fitting on each training fold and evaluating posterior predictive log densities on the held-out fold. The mechanics of the folds are the same as in the Python code above; the only difference is replacing plug-in normal densities with posterior-averaged predictive densities.