Earnings compound model

Source: Earnings/earnings_compound.Rmd

Annual earnings have a point mass at zero and a long right tail among people with positive earnings. The R example fits this as a two-part model: a logistic regression for whether earnings are positive, followed by a regression for log earnings conditional on positive earnings. This Python port uses lightweight Bayesian GLM helpers for the two fitted models and posterior predictive simulation.

Setup and data

Code
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from python.bayes_glm import bayes_lm, bayes_logit

root = Path("../../ROS-Examples")
earnings = pd.read_csv(root / "Earnings/data/earnings.csv")
earnings["positive_earn"] = (earnings["earn"] > 0).astype(int)
pos = earnings.loc[earnings["positive_earn"] == 1].copy()
pos["log_earn"] = np.log(pos["earn"])
rng = np.random.default_rng(15015)
earnings.head()
height weight male earn earnk ethnicity education mother_education father_education walk exercise smokenow tense angry age positive_earn
0 74 210.0 1 50000.0 50.0 White 16.0 16.0 16.0 3 3 2.0 0.0 0.0 45 1
1 66 125.0 0 60000.0 60.0 White 16.0 16.0 16.0 6 5 1.0 0.0 0.0 58 1
2 64 126.0 0 30000.0 30.0 White 16.0 16.0 16.0 8 1 2.0 1.0 1.0 29 1
3 65 200.0 0 25000.0 25.0 White 17.0 17.0 NaN 8 1 2.0 0.0 0.0 57 1
4 63 110.0 0 50000.0 50.0 Other 16.0 16.0 16.0 5 6 2.0 0.0 0.0 91 1
Code
earnings[["earn", "positive_earn", "height", "male"]].describe()
earn positive_earn height male
count 1816.000000 1816.000000 1816.000000 1816.000000
mean 21147.296256 0.897026 66.568833 0.371696
std 22531.765123 0.304008 3.831822 0.483391
min 0.000000 0.000000 57.000000 0.000000
25% 6000.000000 1.000000 64.000000 0.000000
50% 16000.000000 1.000000 66.000000 0.000000
75% 27000.000000 1.000000 69.250000 1.000000
max 400000.000000 1.000000 82.000000 1.000000

Part 1: probability of nonzero earnings

Code
fit_positive = bayes_logit("positive_earn ~ height + male", data=earnings, draws=4000, prior_scale=2.5, seed=15015)
fit_positive.summary().round(3)
mean sd q5 q50 q95
Intercept -1.832 1.553 -4.470 -1.823 0.686
height 0.055 0.024 0.016 0.055 0.097
male 1.731 0.294 1.251 1.727 2.220
Code
positive_summary = fit_positive.summary().round(3)
positive_summary["odds_ratio_mean"] = np.exp(fit_positive.beta_draws).mean(axis=0)
positive_summary.round(3)
mean sd q5 q50 q95 odds_ratio_mean
Intercept -1.832 1.553 -4.470 -1.823 0.686 0.528
height 0.055 0.024 0.016 0.055 0.097 1.057
male 1.731 0.294 1.251 1.727 2.220 5.896

Part 2: log earnings among earners

Code
fit_logearn = bayes_lm("log_earn ~ height + male", data=pos, draws=4000, prior_scale=10.0, seed=15016)
fit_logearn.summary().round(3)
mean sd q5 q50 q95
Intercept 7.960 0.505 7.139 7.963 8.786
height 0.024 0.008 0.011 0.024 0.037
male 0.369 0.060 0.272 0.370 0.468
sigma 0.868 0.015 0.842 0.867 0.894

The first model predicts the chance of being above zero. The second predicts the conditional distribution on the log scale for respondents whose earnings are positive.

Predictions for a new person

The R page predicts earnings for a new person with height = 68 and male = 0. We reproduce the compound prediction by drawing coefficients for both models, drawing a Bernoulli indicator from the logistic part, and drawing log earnings from the normal linear model.

Code
new_person = pd.DataFrame({"height": [68.0], "male": [0]})

p_positive = fit_positive.epred(new_person)[:, 0]
positive_draw = rng.binomial(1, p_positive)
log_earn_draw = fit_logearn.predict(new_person, seed=15017)[:, 0]
earn_draw = np.where(positive_draw == 1, np.exp(log_earn_draw), 0.0)

pd.Series(earn_draw).describe(percentiles=[0.05, 0.25, 0.5, 0.75, 0.95])
count      4000.000000
mean      18727.113870
std       21485.059911
min           0.000000
5%            0.000000
25%        6063.584433
50%       12957.720822
75%       24378.923823
95%       56944.308405
max      330269.465321
dtype: float64
Code
fig, axes = plt.subplots(1, 2, figsize=(11, 4))
axes[0].hist(p_positive, bins=40, color="0.75", edgecolor="white")
axes[0].set_xlabel("Pr(earn > 0 | height=68, male=0)")
axes[0].set_ylabel("draws")

positive_pred = earn_draw[earn_draw > 0]
axes[1].hist(positive_pred, bins=50, color="0.75", edgecolor="white")
axes[1].set_xlabel("predicted annual earnings, conditional draws above zero")
axes[1].set_ylabel("draws")
Text(0, 0.5, 'draws')

Code
compound_summary = pd.Series({
    "Pr(predicted earn = 0)": np.mean(earn_draw == 0),
    "median predicted earn": np.median(earn_draw),
    "mean predicted earn": np.mean(earn_draw),
    "90% interval lower": np.quantile(earn_draw, 0.05),
    "90% interval upper": np.quantile(earn_draw, 0.95),
})
compound_summary
Pr(predicted earn = 0)        0.118500
median predicted earn     12957.720822
mean predicted earn       18727.113870
90% interval lower            0.000000
90% interval upper        56944.308405
dtype: float64

The compound model is more appropriate than a single normal regression on dollars because it separates the structural spike at zero from the skewed positive part of the distribution.