Source: Earnings/earnings_regression.Rmd

The original page studies several regressions of annual earnings on height and sex: first on the dollar scale, then on log-earnings, with posterior predictive checks. This Python port keeps statsmodels for familiar least-squares summaries and uses the lightweight Bayesian linear-regression helper for posterior and posterior-predictive draws.

Setup and data

Code
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from scipy.stats import gaussian_kde
from python.bayes_glm import bayes_lm

root = Path("../../ROS-Examples")
earnings = pd.read_csv(root / "Earnings/data/earnings.csv")
rng = np.random.default_rng(7783)
earnings["height_jitter"] = earnings["height"] + rng.uniform(-0.2, 0.2, len(earnings))
earnings.head()
height weight male earn earnk ethnicity education mother_education father_education walk exercise smokenow tense angry age height_jitter
0 74 210.0 1 50000.0 50.0 White 16.0 16.0 16.0 3 3 2.0 0.0 0.0 45 73.973591
1 66 125.0 0 60000.0 60.0 White 16.0 16.0 16.0 6 5 1.0 0.0 0.0 58 65.806431
2 64 126.0 0 30000.0 30.0 White 16.0 16.0 16.0 8 1 2.0 1.0 1.0 29 63.891963
3 65 200.0 0 25000.0 25.0 White 17.0 17.0 NaN 8 1 2.0 0.0 0.0 57 64.927581
4 63 110.0 0 50000.0 50.0 Other 16.0 16.0 16.0 5 6 2.0 0.0 0.0 91 63.064507
Code
earnings[["height", "male", "earn", "earnk", "age", "education"]].describe()
height male earn earnk age education
count 1816.000000 1816.000000 1816.000000 1816.000000 1816.000000 1814.000000
mean 66.568833 0.371696 21147.296256 21.147296 42.934471 13.235391
std 3.831822 0.483391 22531.765123 22.531765 17.161672 2.556638
min 57.000000 0.000000 0.000000 0.000000 18.000000 2.000000
25% 64.000000 0.000000 6000.000000 6.000000 29.000000 12.000000
50% 66.000000 0.000000 16000.000000 16.000000 39.000000 12.000000
75% 69.250000 1.000000 27000.000000 27.000000 56.000000 15.000000
max 82.000000 1.000000 400000.000000 400.000000 91.000000 18.000000

Normal linear regression on dollars

Code
fit_0 = smf.ols("earn ~ height", data=earnings).fit()
fit_0_bayes = bayes_lm("earn ~ height", data=earnings, draws=1000, prior_scale=10.0, seed=7783)
fit_0.summary()
OLS Regression Results
Dep. Variable: earn R-squared: 0.074
Model: OLS Adj. R-squared: 0.073
Method: Least Squares F-statistic: 144.1
Date: Sun, 31 May 2026 Prob (F-statistic): 5.42e-32
Time: 23:01:48 Log-Likelihood: -20708.
No. Observations: 1816 AIC: 4.142e+04
Df Residuals: 1814 BIC: 4.143e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -8.503e+04 8860.650 -9.596 0.000 -1.02e+05 -6.76e+04
height 1594.9598 132.885 12.003 0.000 1334.336 1855.584
Omnibus: 1886.197 Durbin-Watson: 1.904
Prob(Omnibus): 0.000 Jarque-Bera (JB): 234212.585
Skew: 4.768 Prob(JB): 0.00
Kurtosis: 57.812 Cond. No. 1.16e+03


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.16e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Code
sims_0 = fit_0_bayes.beta_draws
keep = earnings["earn"] <= 200_000
xs = np.linspace(earnings.height.min(), earnings.height.max(), 200)

fig, ax = plt.subplots()
ax.scatter(earnings.loc[keep, "height_jitter"], earnings.loc[keep, "earn"], s=12, alpha=0.55)
for b in sims_0[:10]:
    ax.plot(xs, b[0] + b[1] * xs, color="gray", alpha=0.45, linewidth=0.8)
ax.plot(xs, fit_0.params["Intercept"] + fit_0.params["height"] * xs, color="black")
ax.set_xlabel("height")
ax.set_ylabel("earnings")
ax.set_title("Fitted linear model")
Text(0.5, 1.0, 'Fitted linear model')

The intercept is not meaningful: the model is being extrapolated to height zero. That is the point of the next display.

Code
fig, ax = plt.subplots()
ax.scatter(earnings.loc[keep, "height_jitter"], earnings.loc[keep, "earn"], s=12, alpha=0.55)
for b in sims_0[:10]:
    ax.plot(np.linspace(0, earnings.height.max(), 200), b[0] + b[1] * np.linspace(0, earnings.height.max(), 200), color="gray", alpha=0.45, linewidth=0.8)
ax.plot(np.linspace(0, earnings.height.max(), 200), fit_0.params["Intercept"] + fit_0.params["height"] * np.linspace(0, earnings.height.max(), 200), color="black")
ax.axhline(0, color="gray", linewidth=0.8)
ax.set_xlim(0, earnings.height.max())
ax.set_ylim(-100_000, 220_000)
ax.set_xlabel("height")
ax.set_ylabel("earnings")
ax.set_title("x-axis extended to 0")
Text(0.5, 1.0, 'x-axis extended to 0')

Earnings in thousands of dollars

Scaling the outcome rescales the coefficients but does not otherwise change the fitted values.

Code
fit_1 = smf.ols("earnk ~ height", data=earnings).fit()
pd.DataFrame({
    "dollar_model": fit_0.params,
    "thousand_model_times_1000": fit_1.params * 1000,
})
dollar_model thousand_model_times_1000
Intercept -85027.312609 -85027.312609
height 1594.959754 1594.959754

Add sex indicator and interaction

Code
fit_2 = smf.ols("earnk ~ height + male", data=earnings).fit()
fit_3 = smf.ols("earnk ~ height * male", data=earnings).fit()
fit_2.summary()
OLS Regression Results
Dep. Variable: earnk R-squared: 0.100
Model: OLS Adj. R-squared: 0.099
Method: Least Squares F-statistic: 100.3
Date: Sun, 31 May 2026 Prob (F-statistic): 4.88e-42
Time: 23:01:48 Log-Likelihood: -8137.7
No. Observations: 1816 AIC: 1.628e+04
Df Residuals: 1813 BIC: 1.630e+04
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -25.8722 11.962 -2.163 0.031 -49.332 -2.412
height 0.6470 0.185 3.493 0.000 0.284 1.010
male 10.6327 1.468 7.241 0.000 7.753 13.512
Omnibus: 1902.421 Durbin-Watson: 1.895
Prob(Omnibus): 0.000 Jarque-Bera (JB): 249218.549
Skew: 4.821 Prob(JB): 0.00
Kurtosis: 59.575 Cond. No. 1.59e+03


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.59e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Code
fit_3.summary()
OLS Regression Results
Dep. Variable: earnk R-squared: 0.101
Model: OLS Adj. R-squared: 0.100
Method: Least Squares F-statistic: 67.85
Date: Sun, 31 May 2026 Prob (F-statistic): 1.39e-41
Time: 23:01:48 Log-Likelihood: -8136.3
No. Observations: 1816 AIC: 1.628e+04
Df Residuals: 1812 BIC: 1.630e+04
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -8.4872 15.865 -0.535 0.593 -39.603 22.629
height 0.3774 0.246 1.535 0.125 -0.105 0.860
male -31.5146 25.326 -1.244 0.214 -81.186 18.157
height:male 0.6229 0.374 1.667 0.096 -0.110 1.356
Omnibus: 1899.817 Durbin-Watson: 1.894
Prob(Omnibus): 0.000 Jarque-Bera (JB): 247552.972
Skew: 4.811 Prob(JB): 0.00
Kurtosis: 59.383 Cond. No. 4.05e+03


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.05e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Code
coef2 = fit_2.params * 1000
coef3 = fit_3.params * 1000
xgrid = np.linspace(earnings.height.min(), earnings.height.max(), 200)

fig, axes = plt.subplots(1, 2, figsize=(11, 4), sharey=True)
axes[0].plot(xgrid, coef2["Intercept"] + coef2["height"] * xgrid, color="tab:red", label="women")
axes[0].plot(xgrid, coef2["Intercept"] + coef2["male"] + coef2["height"] * xgrid, color="tab:blue", label="men")
axes[0].set_title("Parallel lines")

axes[1].plot(xgrid, coef3["Intercept"] + coef3["height"] * xgrid, color="tab:red", label="women")
axes[1].plot(xgrid, coef3["Intercept"] + coef3["male"] + (coef3["height"] + coef3["height:male"]) * xgrid, color="tab:blue", label="men")
axes[1].set_title("Interaction")

for ax in axes:
    ax.set_xlabel("height")
    ax.set_ylabel("predicted earnings")
    ax.legend(frameon=False)

Log earnings models

The dollar-scale normal model is a poor match for a strongly skewed outcome with many low values. The R page therefore compares models for log(earn) among respondents with positive earnings.

Code
pos = earnings[earnings.earn > 0].copy()
pos["log_earn"] = np.log(pos["earn"])
pos["log10_earn"] = np.log10(pos["earn"])
pos["log_height"] = np.log(pos["height"])
pos["z_height"] = (pos["height"] - pos["height"].mean()) / pos["height"].std()

logmodel_1 = smf.ols("log_earn ~ height", data=pos).fit()
logmodel_1_bayes = bayes_lm("log_earn ~ height", data=pos, draws=1000, prior_scale=10.0, seed=7784)
log10model_1 = smf.ols("log10_earn ~ height", data=pos).fit()
logmodel_2 = smf.ols("log_earn ~ height + male", data=pos).fit()
loglogmodel_2 = smf.ols("log_earn ~ log_height + male", data=pos).fit()
logmodel_3 = smf.ols("log_earn ~ height * male", data=pos).fit()
logmodel_3a = smf.ols("log_earn ~ z_height * male", data=pos).fit()

pd.DataFrame({
    "log height": logmodel_1.params,
    "log height+male": logmodel_2.params,
}).round(3)
log height log height+male
Intercept 5.909 7.976
height 0.057 0.024
male NaN 0.372
Code
sims_log = logmodel_1_bayes.beta_draws
xs = np.linspace(pos.height.min(), pos.height.max(), 200)
fig, axes = plt.subplots(1, 2, figsize=(11, 4))

axes[0].scatter(pos.height_jitter, pos.log_earn, s=12, alpha=0.55)
for b in sims_log[:10]:
    axes[0].plot(xs, b[0] + b[1] * xs, color="gray", alpha=0.45, linewidth=0.8)
axes[0].plot(xs, logmodel_1.params["Intercept"] + logmodel_1.params["height"] * xs, color="black")
axes[0].set_xlabel("height")
axes[0].set_ylabel("log(earnings)")
axes[0].set_title("Log regression on log scale")

small = pos[pos.earn <= 200_000]
axes[1].scatter(small.height_jitter, small.earn, s=12, alpha=0.55)
for b in sims_log[:10]:
    axes[1].plot(xs, np.exp(b[0] + b[1] * xs), color="gray", alpha=0.45, linewidth=0.8)
axes[1].plot(xs, np.exp(logmodel_1.params["Intercept"] + logmodel_1.params["height"] * xs), color="black")
axes[1].set_xlabel("height")
axes[1].set_ylabel("earnings")
axes[1].set_title("Same fit on dollar scale")
Text(0.5, 1.0, 'Same fit on dollar scale')

Lightweight posterior predictive checks

The Bayesian helper supplies a direct analogue of posterior_predict(): draw coefficients and residual scale from the Gaussian linear-model posterior, then draw replicated outcomes.

Code
yrep_0 = fit_0_bayes.predict(seed=1)[:200]
yrep_log_1 = logmodel_1_bayes.predict(seed=2)[:200]

fig, axes = plt.subplots(1, 2, figsize=(11, 4))
for draw in yrep_0[:80]:
    kde = gaussian_kde(draw)
    grid = np.linspace(np.percentile(earnings.earn, 1), np.percentile(earnings.earn, 99), 200)
    axes[0].plot(grid, kde(grid), color="gray", alpha=0.08)
axes[0].plot(grid, gaussian_kde(earnings.earn)(grid), color="black")
axes[0].set_title("earn")

log_grid = np.linspace(pos.log_earn.min(), pos.log_earn.max(), 200)
for draw in yrep_log_1[:80]:
    kde = gaussian_kde(draw)
    axes[1].plot(log_grid, kde(log_grid), color="gray", alpha=0.08)
axes[1].plot(log_grid, gaussian_kde(pos.log_earn)(log_grid), color="black")
axes[1].set_title("log(earn)")
Text(0.5, 1.0, 'log(earn)')

CmdStanPy analogue

Code
from cmdstanpy import CmdStanModel

stan_code = """
data {
  int<lower=1> N;
  matrix[N, 3] X;
  vector[N] y;
}
parameters {
  vector[3] beta;
  real<lower=0> sigma;
}
model {
  beta ~ normal(0, 10);
  sigma ~ exponential(1);
  y ~ normal(X * beta, sigma);
}
generated quantities {
  vector[N] y_rep;
  for (n in 1:N) y_rep[n] = normal_rng(dot_product(row(X, n), beta), sigma);
}
"""
Path("earnings_height_male.stan").write_text(stan_code)
X = np.column_stack([np.ones(len(pos)), pos.height.to_numpy(), pos.male.to_numpy()])
model = CmdStanModel(stan_file="earnings_height_male.stan")
fit = model.sample(data={"N": len(pos), "X": X, "y": pos.log_earn.to_numpy()}, seed=7783)
fit.summary().loc[["beta[1]", "beta[2]", "beta[3]", "sigma"]]