Student grades: standardized predictors and coefficient priors

Source: Student/student.Rmd

This example uses many predictors for Portuguese student math grades. The central lesson is that coefficient priors are hard to interpret before standardizing predictors.

Code

from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import RidgeCV
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder

root = Path("../../ROS-Examples")
data = pd.read_csv(root / "Student/data/student-merged.csv")
grades = ["G1mat", "G2mat", "G3mat", "G1por", "G2por", "G3por"]
predictors = ["school","sex","age","address","famsize","Pstatus","Medu","Fedu","traveltime","studytime","failures","schoolsup","famsup","paid","activities", "nursery", "higher", "internet", "romantic","famrel","freetime","goout","Dalc","Walc","health","absences"]
data_g3 = data.loc[data.G3mat > 0, ["G3mat"] + predictors].copy()
data_g3.head()

	G3mat	age	Pstatus	Medu	Fedu	traveltime	...	higher	internet	romantic	famrel	freetime	goout	Dalc	Walc	health	absences
0	10	15	1	1	1	2	...	1	1	0	3	1	2	1	1	1	2
1	5	15	1	1	1	1	...	1	1	1	3	3	4	2	4	5	2
2	13	15	1	2	2	1	...	1	0	0	4	3	1	1	1	2	8
3	8	15	1	2	4	1	...	1	1	0	4	3	2	1	1	5	2
4	10	15	1	3	3	2	...	1	1	1	4	2	1	2	3	3	8

5 rows × 27 columns

Ordinary least squares with many predictors

Code

formula = "G3mat ~ " + " + ".join([f"C({v})" if data_g3[v].dtype == 'object' else v for v in predictors])
fit0 = smf.ols(formula, data=data_g3).fit()
fit0.rsquared, len(fit0.params)

(np.float64(0.28838153223153684), 27)

Code

coef = fit0.params.drop("Intercept", errors="ignore")
se = fit0.bse.reindex(coef.index)
summary = pd.DataFrame({"estimate": coef, "std_error": se}).sort_values("estimate")
summary.tail(10)

	estimate	std_error
higher	0.054212	1.011594
Dalc	0.127227	0.235346
activities	0.132871	0.337517
Medu	0.226696	0.196565
Fedu	0.245189	0.194686
famsize	0.275087	0.359530
studytime	0.366581	0.210866
address	0.507731	0.437060
sex	0.629514	0.367337
internet	0.749947	0.469349

Standardize numeric predictors

Code

datastd = data_g3.copy()
num_cols = [c for c in predictors if pd.api.types.is_numeric_dtype(datastd[c])]
datastd[num_cols] = StandardScaler().fit_transform(datastd[num_cols])
fit1 = smf.ols(formula, data=datastd).fit()
pd.DataFrame({"estimate": fit1.params, "std_error": fit1.bse}).drop(index="Intercept", errors="ignore").sort_values("estimate").tail(12)

	estimate	std_error
freetime	0.010428	0.174522
traveltime	0.012737	0.175900
famrel	0.013946	0.164525
activities	0.066333	0.168500
Dalc	0.116524	0.215546
famsize	0.124649	0.162912
address	0.206771	0.177991
Medu	0.245455	0.212831
Fedu	0.265063	0.210466
internet	0.266817	0.166985
studytime	0.307551	0.176911
sex	0.314692	0.183630

Standardization puts numeric predictors on comparable scales. This makes a common prior like $\beta_j \sim N(0, 2.5)$ meaningful across coefficients.

Regularized regression as a Python analogue

Code

X = data_g3[predictors]
y = data_g3["G3mat"].to_numpy()
cat_cols = [c for c in predictors if not pd.api.types.is_numeric_dtype(X[c])]
pre = ColumnTransformer([
    ("num", StandardScaler(), num_cols),
    ("cat", OneHotEncoder(drop="first", handle_unknown="ignore"), cat_cols),
])
ridge = make_pipeline(pre, RidgeCV(alphas=np.logspace(-3, 3, 30)))
ridge.fit(X, y)
print("R^2:", ridge.score(X, y))

R^2: 0.2698053803177328

The ridge fit is not a replacement for the full Bayesian model, but it is a useful computational analogue: after standardization, shrinkage priors and penalization act on comparable coefficient scales.

CmdStanPy sketch

data { int<lower=1> N; int<lower=1> K; matrix[N,K] X; vector[N] y; }
parameters { real alpha; vector[K] beta; real<lower=0> sigma; }
model {
  alpha ~ normal(mean(y), 5);
  beta ~ normal(0, 1);
  sigma ~ exponential(1);
  y ~ normal(alpha + X * beta, sigma);
}

# Student grades: standardized predictors and coefficient priors Source: `Student/student.Rmd` This example uses many predictors for Portuguese student math grades. The central lesson is that coefficient priors are hard to interpret before standardizing predictors. ```{python} from pathlib import Path import numpy as np import pandas as pd import matplotlib.pyplot as plt import statsmodels.formula.api as smf from sklearn.preprocessing import StandardScaler from sklearn.linear_model import RidgeCV from sklearn.compose import ColumnTransformer from sklearn.pipeline import make_pipeline from sklearn.preprocessing import OneHotEncoder root = Path("../../ROS-Examples") data = pd.read_csv(root / "Student/data/student-merged.csv") grades = ["G1mat", "G2mat", "G3mat", "G1por", "G2por", "G3por"] predictors = ["school","sex","age","address","famsize","Pstatus","Medu","Fedu","traveltime","studytime","failures","schoolsup","famsup","paid","activities", "nursery", "higher", "internet", "romantic","famrel","freetime","goout","Dalc","Walc","health","absences"] data_g3 = data.loc[data.G3mat > 0, ["G3mat"] + predictors].copy() data_g3.head() ``` ## Ordinary least squares with many predictors ```{python} formula = "G3mat ~ " + " + ".join([f"C({v})" if data_g3[v].dtype == 'object' else v for v in predictors]) fit0 = smf.ols(formula, data=data_g3).fit() fit0.rsquared, len(fit0.params) ``` ```{python} coef = fit0.params.drop("Intercept", errors="ignore") se = fit0.bse.reindex(coef.index) summary = pd.DataFrame({"estimate": coef, "std_error": se}).sort_values("estimate") summary.tail(10) ``` ## Standardize numeric predictors ```{python} datastd = data_g3.copy() num_cols = [c for c in predictors if pd.api.types.is_numeric_dtype(datastd[c])] datastd[num_cols] = StandardScaler().fit_transform(datastd[num_cols]) fit1 = smf.ols(formula, data=datastd).fit() pd.DataFrame({"estimate": fit1.params, "std_error": fit1.bse}).drop(index="Intercept", errors="ignore").sort_values("estimate").tail(12) ``` Standardization puts numeric predictors on comparable scales. This makes a common prior like $\beta_j \sim N(0, 2.5)$ meaningful across coefficients. ## Regularized regression as a Python analogue ```{python} X = data_g3[predictors] y = data_g3["G3mat"].to_numpy() cat_cols = [c for c in predictors if not pd.api.types.is_numeric_dtype(X[c])] pre = ColumnTransformer([ ("num", StandardScaler(), num_cols), ("cat", OneHotEncoder(drop="first", handle_unknown="ignore"), cat_cols), ]) ridge = make_pipeline(pre, RidgeCV(alphas=np.logspace(-3, 3, 30))) ridge.fit(X, y) print("R^2:", ridge.score(X, y)) ``` The ridge fit is not a replacement for the full Bayesian model, but it is a useful computational analogue: after standardization, shrinkage priors and penalization act on comparable coefficient scales. ## CmdStanPy sketch ```stan data { int<lower=1> N; int<lower=1> K; matrix[N,K] X; vector[N] y; } parameters { real alpha; vector[K] beta; real<lower=0> sigma; } model { alpha ~ normal(mean(y), 5); beta ~ normal(0, 1); sigma ~ exponential(1); y ~ normal(alpha + X * beta, sigma); } ```