Earnings regression

Source: Earnings/earnings_regression.Rmd

The original page studies several regressions of annual earnings on height and sex: first on the dollar scale, then on log-earnings, with posterior predictive checks. This Python port keeps statsmodels for familiar least-squares summaries and uses the lightweight Bayesian linear-regression helper for posterior and posterior-predictive draws.

Setup and data

Code

from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from scipy.stats import gaussian_kde
from python.bayes_glm import bayes_lm

root = Path("../../ROS-Examples")
earnings = pd.read_csv(root / "Earnings/data/earnings.csv")
rng = np.random.default_rng(7783)
earnings["height_jitter"] = earnings["height"] + rng.uniform(-0.2, 0.2, len(earnings))
earnings.head()

	height	weight	male	earn	earnk	ethnicity	education	mother_education	father_education	walk	exercise	smokenow	tense	angry	age	height_jitter
0	74	210.0	1	50000.0	50.0	White	16.0	16.0	16.0	3	3	2.0	0.0	0.0	45	73.973591
1	66	125.0	0	60000.0	60.0	White	16.0	16.0	16.0	6	5	1.0	0.0	0.0	58	65.806431
2	64	126.0	0	30000.0	30.0	White	16.0	16.0	16.0	8	1	2.0	1.0	1.0	29	63.891963
3	65	200.0	0	25000.0	25.0	White	17.0	17.0	NaN	8	1	2.0	0.0	0.0	57	64.927581
4	63	110.0	0	50000.0	50.0	Other	16.0	16.0	16.0	5	6	2.0	0.0	0.0	91	63.064507

Code

earnings[["height", "male", "earn", "earnk", "age", "education"]].describe()

	height	male	earn	earnk	age	education
count	1816.000000	1816.000000	1816.000000	1816.000000	1816.000000	1814.000000
mean	66.568833	0.371696	21147.296256	21.147296	42.934471	13.235391
std	3.831822	0.483391	22531.765123	22.531765	17.161672	2.556638
min	57.000000	0.000000	0.000000	0.000000	18.000000	2.000000
25%	64.000000	0.000000	6000.000000	6.000000	29.000000	12.000000
50%	66.000000	0.000000	16000.000000	16.000000	39.000000	12.000000
75%	69.250000	1.000000	27000.000000	27.000000	56.000000	15.000000
max	82.000000	1.000000	400000.000000	400.000000	91.000000	18.000000

Normal linear regression on dollars

Code

fit_0 = smf.ols("earn ~ height", data=earnings).fit()
fit_0_bayes = bayes_lm("earn ~ height", data=earnings, draws=1000, prior_scale=10.0, seed=7783)
fit_0.summary()

OLS Regression Results
Dep. Variable:	earn	R-squared:	0.074
Model:	OLS	Adj. R-squared:	0.073
Method:	Least Squares	F-statistic:	144.1
Date:	Sun, 31 May 2026	Prob (F-statistic):	5.42e-32
Time:	23:01:48	Log-Likelihood:	-20708.
No. Observations:	1816	AIC:	4.142e+04
Df Residuals:	1814	BIC:	4.143e+04
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-8.503e+04	8860.650	-9.596	0.000	-1.02e+05	-6.76e+04
height	1594.9598	132.885	12.003	0.000	1334.336	1855.584

Omnibus:	1886.197	Durbin-Watson:	1.904
Prob(Omnibus):	0.000	Jarque-Bera (JB):	234212.585
Skew:	4.768	Prob(JB):	0.00
Kurtosis:	57.812	Cond. No.	1.16e+03

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.16e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Code

sims_0 = fit_0_bayes.beta_draws
keep = earnings["earn"] <= 200_000
xs = np.linspace(earnings.height.min(), earnings.height.max(), 200)

fig, ax = plt.subplots()
ax.scatter(earnings.loc[keep, "height_jitter"], earnings.loc[keep, "earn"], s=12, alpha=0.55)
for b in sims_0[:10]:
    ax.plot(xs, b[0] + b[1] * xs, color="gray", alpha=0.45, linewidth=0.8)
ax.plot(xs, fit_0.params["Intercept"] + fit_0.params["height"] * xs, color="black")
ax.set_xlabel("height")
ax.set_ylabel("earnings")
ax.set_title("Fitted linear model")

Text(0.5, 1.0, 'Fitted linear model')

The intercept is not meaningful: the model is being extrapolated to height zero. That is the point of the next display.

Code

fig, ax = plt.subplots()
ax.scatter(earnings.loc[keep, "height_jitter"], earnings.loc[keep, "earn"], s=12, alpha=0.55)
for b in sims_0[:10]:
    ax.plot(np.linspace(0, earnings.height.max(), 200), b[0] + b[1] * np.linspace(0, earnings.height.max(), 200), color="gray", alpha=0.45, linewidth=0.8)
ax.plot(np.linspace(0, earnings.height.max(), 200), fit_0.params["Intercept"] + fit_0.params["height"] * np.linspace(0, earnings.height.max(), 200), color="black")
ax.axhline(0, color="gray", linewidth=0.8)
ax.set_xlim(0, earnings.height.max())
ax.set_ylim(-100_000, 220_000)
ax.set_xlabel("height")
ax.set_ylabel("earnings")
ax.set_title("x-axis extended to 0")

Text(0.5, 1.0, 'x-axis extended to 0')

Earnings in thousands of dollars

Scaling the outcome rescales the coefficients but does not otherwise change the fitted values.

Code

fit_1 = smf.ols("earnk ~ height", data=earnings).fit()
pd.DataFrame({
    "dollar_model": fit_0.params,
    "thousand_model_times_1000": fit_1.params * 1000,
})

	dollar_model	thousand_model_times_1000
Intercept	-85027.312609	-85027.312609
height	1594.959754	1594.959754

Add sex indicator and interaction

Code

fit_2 = smf.ols("earnk ~ height + male", data=earnings).fit()
fit_3 = smf.ols("earnk ~ height * male", data=earnings).fit()
fit_2.summary()

OLS Regression Results
Dep. Variable:	earnk	R-squared:	0.100
Model:	OLS	Adj. R-squared:	0.099
Method:	Least Squares	F-statistic:	100.3
Date:	Sun, 31 May 2026	Prob (F-statistic):	4.88e-42
Time:	23:01:48	Log-Likelihood:	-8137.7
No. Observations:	1816	AIC:	1.628e+04
Df Residuals:	1813	BIC:	1.630e+04
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-25.8722	11.962	-2.163	0.031	-49.332	-2.412
height	0.6470	0.185	3.493	0.000	0.284	1.010
male	10.6327	1.468	7.241	0.000	7.753	13.512

Omnibus:	1902.421	Durbin-Watson:	1.895
Prob(Omnibus):	0.000	Jarque-Bera (JB):	249218.549
Skew:	4.821	Prob(JB):	0.00
Kurtosis:	59.575	Cond. No.	1.59e+03

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.59e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Code

fit_3.summary()

OLS Regression Results
Dep. Variable:	earnk	R-squared:	0.101
Model:	OLS	Adj. R-squared:	0.100
Method:	Least Squares	F-statistic:	67.85
Date:	Sun, 31 May 2026	Prob (F-statistic):	1.39e-41
Time:	23:01:48	Log-Likelihood:	-8136.3
No. Observations:	1816	AIC:	1.628e+04
Df Residuals:	1812	BIC:	1.630e+04
Df Model:	3
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	-8.4872	15.865	-0.535	0.593	-39.603	22.629
height	0.3774	0.246	1.535	0.125	-0.105	0.860
male	-31.5146	25.326	-1.244	0.214	-81.186	18.157
height:male	0.6229	0.374	1.667	0.096	-0.110	1.356

Omnibus:	1899.817	Durbin-Watson:	1.894
Prob(Omnibus):	0.000	Jarque-Bera (JB):	247552.972
Skew:	4.811	Prob(JB):	0.00
Kurtosis:	59.383	Cond. No.	4.05e+03

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.05e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Code

coef2 = fit_2.params * 1000
coef3 = fit_3.params * 1000
xgrid = np.linspace(earnings.height.min(), earnings.height.max(), 200)

fig, axes = plt.subplots(1, 2, figsize=(11, 4), sharey=True)
axes[0].plot(xgrid, coef2["Intercept"] + coef2["height"] * xgrid, color="tab:red", label="women")
axes[0].plot(xgrid, coef2["Intercept"] + coef2["male"] + coef2["height"] * xgrid, color="tab:blue", label="men")
axes[0].set_title("Parallel lines")

axes[1].plot(xgrid, coef3["Intercept"] + coef3["height"] * xgrid, color="tab:red", label="women")
axes[1].plot(xgrid, coef3["Intercept"] + coef3["male"] + (coef3["height"] + coef3["height:male"]) * xgrid, color="tab:blue", label="men")
axes[1].set_title("Interaction")

for ax in axes:
    ax.set_xlabel("height")
    ax.set_ylabel("predicted earnings")
    ax.legend(frameon=False)

Log earnings models

The dollar-scale normal model is a poor match for a strongly skewed outcome with many low values. The R page therefore compares models for log(earn) among respondents with positive earnings.

Code

pos = earnings[earnings.earn > 0].copy()
pos["log_earn"] = np.log(pos["earn"])
pos["log10_earn"] = np.log10(pos["earn"])
pos["log_height"] = np.log(pos["height"])
pos["z_height"] = (pos["height"] - pos["height"].mean()) / pos["height"].std()

logmodel_1 = smf.ols("log_earn ~ height", data=pos).fit()
logmodel_1_bayes = bayes_lm("log_earn ~ height", data=pos, draws=1000, prior_scale=10.0, seed=7784)
log10model_1 = smf.ols("log10_earn ~ height", data=pos).fit()
logmodel_2 = smf.ols("log_earn ~ height + male", data=pos).fit()
loglogmodel_2 = smf.ols("log_earn ~ log_height + male", data=pos).fit()
logmodel_3 = smf.ols("log_earn ~ height * male", data=pos).fit()
logmodel_3a = smf.ols("log_earn ~ z_height * male", data=pos).fit()

pd.DataFrame({
    "log height": logmodel_1.params,
    "log height+male": logmodel_2.params,
}).round(3)

	log height	log height+male
Intercept	5.909	7.976
height	0.057	0.024
male	NaN	0.372

Code

sims_log = logmodel_1_bayes.beta_draws
xs = np.linspace(pos.height.min(), pos.height.max(), 200)
fig, axes = plt.subplots(1, 2, figsize=(11, 4))

axes[0].scatter(pos.height_jitter, pos.log_earn, s=12, alpha=0.55)
for b in sims_log[:10]:
    axes[0].plot(xs, b[0] + b[1] * xs, color="gray", alpha=0.45, linewidth=0.8)
axes[0].plot(xs, logmodel_1.params["Intercept"] + logmodel_1.params["height"] * xs, color="black")
axes[0].set_xlabel("height")
axes[0].set_ylabel("log(earnings)")
axes[0].set_title("Log regression on log scale")

small = pos[pos.earn <= 200_000]
axes[1].scatter(small.height_jitter, small.earn, s=12, alpha=0.55)
for b in sims_log[:10]:
    axes[1].plot(xs, np.exp(b[0] + b[1] * xs), color="gray", alpha=0.45, linewidth=0.8)
axes[1].plot(xs, np.exp(logmodel_1.params["Intercept"] + logmodel_1.params["height"] * xs), color="black")
axes[1].set_xlabel("height")
axes[1].set_ylabel("earnings")
axes[1].set_title("Same fit on dollar scale")

Text(0.5, 1.0, 'Same fit on dollar scale')

Lightweight posterior predictive checks

The Bayesian helper supplies a direct analogue of posterior_predict(): draw coefficients and residual scale from the Gaussian linear-model posterior, then draw replicated outcomes.

Code

yrep_0 = fit_0_bayes.predict(seed=1)[:200]
yrep_log_1 = logmodel_1_bayes.predict(seed=2)[:200]

fig, axes = plt.subplots(1, 2, figsize=(11, 4))
for draw in yrep_0[:80]:
    kde = gaussian_kde(draw)
    grid = np.linspace(np.percentile(earnings.earn, 1), np.percentile(earnings.earn, 99), 200)
    axes[0].plot(grid, kde(grid), color="gray", alpha=0.08)
axes[0].plot(grid, gaussian_kde(earnings.earn)(grid), color="black")
axes[0].set_title("earn")

log_grid = np.linspace(pos.log_earn.min(), pos.log_earn.max(), 200)
for draw in yrep_log_1[:80]:
    kde = gaussian_kde(draw)
    axes[1].plot(log_grid, kde(log_grid), color="gray", alpha=0.08)
axes[1].plot(log_grid, gaussian_kde(pos.log_earn)(log_grid), color="black")
axes[1].set_title("log(earn)")

Text(0.5, 1.0, 'log(earn)')

CmdStanPy analogue

Code

from cmdstanpy import CmdStanModel

stan_code = """
data {
  int<lower=1> N;
  matrix[N, 3] X;
  vector[N] y;
}
parameters {
  vector[3] beta;
  real<lower=0> sigma;
}
model {
  beta ~ normal(0, 10);
  sigma ~ exponential(1);
  y ~ normal(X * beta, sigma);
}
generated quantities {
  vector[N] y_rep;
  for (n in 1:N) y_rep[n] = normal_rng(dot_product(row(X, n), beta), sigma);
}
"""
Path("earnings_height_male.stan").write_text(stan_code)
X = np.column_stack([np.ones(len(pos)), pos.height.to_numpy(), pos.male.to_numpy()])
model = CmdStanModel(stan_file="earnings_height_male.stan")
fit = model.sample(data={"N": len(pos), "X": X, "y": pos.log_earn.to_numpy()}, seed=7783)
fit.summary().loc[["beta[1]", "beta[2]", "beta[3]", "sigma"]]