Student grades: standardized predictors and coefficient priors

Source: Student/student.Rmd

This example uses many predictors for Portuguese student math grades. The central lesson is that coefficient priors are hard to interpret before standardizing predictors.

Code
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder

root = Path("../../ROS-Examples")
data = pd.read_csv(root / "Student/data/student-merged.csv")
grades = ["G1mat", "G2mat", "G3mat", "G1por", "G2por", "G3por"]
predictors = ["school","sex","age","address","famsize","Pstatus","Medu","Fedu","traveltime","studytime","failures","schoolsup","famsup","paid","activities", "nursery", "higher", "internet", "romantic","famrel","freetime","goout","Dalc","Walc","health","absences"]
data_g3 = data.loc[data.G3mat > 0, ["G3mat"] + predictors].copy()
data_g3.head()
G3mat school sex age address famsize Pstatus Medu Fedu traveltime ... higher internet romantic famrel freetime goout Dalc Walc health absences
0 10 0 0 15 0 0 1 1 1 2 ... 1 1 0 3 1 2 1 1 1 2
1 5 0 0 15 0 0 1 1 1 1 ... 1 1 1 3 3 4 2 4 5 2
2 13 0 0 15 0 0 1 2 2 1 ... 1 0 0 4 3 1 1 1 2 8
3 8 0 0 15 0 0 1 2 4 1 ... 1 1 0 4 3 2 1 1 5 2
4 10 0 0 15 0 0 1 3 3 2 ... 1 1 1 4 2 1 2 3 3 8

5 rows × 27 columns

Ordinary least squares with many predictors

Code
formula = "G3mat ~ " + " + ".join([f"C({v})" if data_g3[v].dtype == 'object' else v for v in predictors])
fit0 = smf.ols(formula, data=data_g3).fit()
fit0.rsquared, len(fit0.params)
(np.float64(0.28838153223153684), 27)
Code
coef = fit0.params.drop("Intercept", errors="ignore")
se = fit0.bse.reindex(coef.index)
summary = pd.DataFrame({"estimate": coef, "std_error": se}).sort_values("estimate")
summary.tail(10)
estimate std_error
higher 0.054212 1.011594
Dalc 0.127227 0.235346
activities 0.132871 0.337517
Medu 0.226696 0.196565
Fedu 0.245189 0.194686
famsize 0.275087 0.359530
studytime 0.366581 0.210866
address 0.507731 0.437060
sex 0.629514 0.367337
internet 0.749947 0.469349

Standardize numeric predictors

Code
datastd = data_g3.copy()
num_cols = [c for c in predictors if pd.api.types.is_numeric_dtype(datastd[c])]
datastd[num_cols] = StandardScaler().fit_transform(datastd[num_cols])
fit1 = smf.ols(formula, data=datastd).fit()
pd.DataFrame({"estimate": fit1.params, "std_error": fit1.bse}).drop(index="Intercept", errors="ignore").sort_values("estimate").tail(12)
estimate std_error
freetime 0.010428 0.174522
traveltime 0.012737 0.175900
famrel 0.013946 0.164525
activities 0.066333 0.168500
Dalc 0.116524 0.215546
famsize 0.124649 0.162912
address 0.206771 0.177991
Medu 0.245455 0.212831
Fedu 0.265063 0.210466
internet 0.266817 0.166985
studytime 0.307551 0.176911
sex 0.314692 0.183630

Standardization puts numeric predictors on comparable scales. This makes a common prior like \(\beta_j \sim N(0, 2.5)\) meaningful across coefficients.

Regularized regression as a Python analogue

Code
X = data_g3[predictors]
y = data_g3["G3mat"].to_numpy()
cat_cols = [c for c in predictors if not pd.api.types.is_numeric_dtype(X[c])]
pre = ColumnTransformer([
    ("num", StandardScaler(), num_cols),
    ("cat", OneHotEncoder(drop="first", handle_unknown="ignore"), cat_cols),
])
ridge = make_pipeline(pre, RidgeCV(alphas=np.logspace(-3, 3, 30)))
ridge.fit(X, y)
print("R^2:", ridge.score(X, y))
R^2: 0.2698053803177328

The ridge fit is not a replacement for the full Bayesian model, but it is a useful computational analogue: after standardization, shrinkage priors and penalization act on comparable coefficient scales.

CmdStanPy sketch

data { int<lower=1> N; int<lower=1> K; matrix[N,K] X; vector[N] y; }
parameters { real alpha; vector[K] beta; real<lower=0> sigma; }
model {
  alpha ~ normal(mean(y), 5);
  beta ~ normal(0, 1);
  sigma ~ exponential(1);
  y ~ normal(alpha + X * beta, sigma);
}