Earnings compound model

Source: Earnings/earnings_compound.Rmd

Annual earnings have a point mass at zero and a long right tail among people with positive earnings. The R example fits this as a two-part model: a logistic regression for whether earnings are positive, followed by a regression for log earnings conditional on positive earnings. This Python port uses lightweight Bayesian GLM helpers for the two fitted models and posterior predictive simulation.

Setup and data

Code

from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from python.bayes_glm import bayes_lm, bayes_logit

root = Path("../../ROS-Examples")
earnings = pd.read_csv(root / "Earnings/data/earnings.csv")
earnings["positive_earn"] = (earnings["earn"] > 0).astype(int)
pos = earnings.loc[earnings["positive_earn"] == 1].copy()
pos["log_earn"] = np.log(pos["earn"])
rng = np.random.default_rng(15015)
earnings.head()

	height	weight	male	earn	earnk	ethnicity	education	mother_education	father_education	walk	exercise	smokenow	tense	angry	age	positive_earn
0	74	210.0	1	50000.0	50.0	White	16.0	16.0	16.0	3	3	2.0	0.0	0.0	45	1
1	66	125.0	0	60000.0	60.0	White	16.0	16.0	16.0	6	5	1.0	0.0	0.0	58	1
2	64	126.0	0	30000.0	30.0	White	16.0	16.0	16.0	8	1	2.0	1.0	1.0	29	1
3	65	200.0	0	25000.0	25.0	White	17.0	17.0	NaN	8	1	2.0	0.0	0.0	57	1
4	63	110.0	0	50000.0	50.0	Other	16.0	16.0	16.0	5	6	2.0	0.0	0.0	91	1

Code

earnings[["earn", "positive_earn", "height", "male"]].describe()

	earn	positive_earn	height	male
count	1816.000000	1816.000000	1816.000000	1816.000000
mean	21147.296256	0.897026	66.568833	0.371696
std	22531.765123	0.304008	3.831822	0.483391
min	0.000000	0.000000	57.000000	0.000000
25%	6000.000000	1.000000	64.000000	0.000000
50%	16000.000000	1.000000	66.000000	0.000000
75%	27000.000000	1.000000	69.250000	1.000000
max	400000.000000	1.000000	82.000000	1.000000

Part 1: probability of nonzero earnings

Code

fit_positive = bayes_logit("positive_earn ~ height + male", data=earnings, draws=4000, prior_scale=2.5, seed=15015)
fit_positive.summary().round(3)

	mean	sd	q5	q50	q95
Intercept	-1.832	1.553	-4.470	-1.823	0.686
height	0.055	0.024	0.016	0.055	0.097
male	1.731	0.294	1.251	1.727	2.220

Code

positive_summary = fit_positive.summary().round(3)
positive_summary["odds_ratio_mean"] = np.exp(fit_positive.beta_draws).mean(axis=0)
positive_summary.round(3)

	mean	sd	q5	q50	q95	odds_ratio_mean
Intercept	-1.832	1.553	-4.470	-1.823	0.686	0.528
height	0.055	0.024	0.016	0.055	0.097	1.057
male	1.731	0.294	1.251	1.727	2.220	5.896

Part 2: log earnings among earners

Code

fit_logearn = bayes_lm("log_earn ~ height + male", data=pos, draws=4000, prior_scale=10.0, seed=15016)
fit_logearn.summary().round(3)

	mean	sd	q5	q50	q95
Intercept	7.960	0.505	7.139	7.963	8.786
height	0.024	0.008	0.011	0.024	0.037
male	0.369	0.060	0.272	0.370	0.468
sigma	0.868	0.015	0.842	0.867	0.894

The first model predicts the chance of being above zero. The second predicts the conditional distribution on the log scale for respondents whose earnings are positive.

Predictions for a new person

The R page predicts earnings for a new person with height = 68 and male = 0. We reproduce the compound prediction by drawing coefficients for both models, drawing a Bernoulli indicator from the logistic part, and drawing log earnings from the normal linear model.

Code

new_person = pd.DataFrame({"height": [68.0], "male": [0]})

p_positive = fit_positive.epred(new_person)[:, 0]
positive_draw = rng.binomial(1, p_positive)
log_earn_draw = fit_logearn.predict(new_person, seed=15017)[:, 0]
earn_draw = np.where(positive_draw == 1, np.exp(log_earn_draw), 0.0)

pd.Series(earn_draw).describe(percentiles=[0.05, 0.25, 0.5, 0.75, 0.95])

count      4000.000000
mean      18727.113870
std       21485.059911
min           0.000000
5%            0.000000
25%        6063.584433
50%       12957.720822
75%       24378.923823
95%       56944.308405
max      330269.465321
dtype: float64

Code

fig, axes = plt.subplots(1, 2, figsize=(11, 4))
axes[0].hist(p_positive, bins=40, color="0.75", edgecolor="white")
axes[0].set_xlabel("Pr(earn > 0 | height=68, male=0)")
axes[0].set_ylabel("draws")

positive_pred = earn_draw[earn_draw > 0]
axes[1].hist(positive_pred, bins=50, color="0.75", edgecolor="white")
axes[1].set_xlabel("predicted annual earnings, conditional draws above zero")
axes[1].set_ylabel("draws")

Text(0, 0.5, 'draws')

Code

compound_summary = pd.Series({
    "Pr(predicted earn = 0)": np.mean(earn_draw == 0),
    "median predicted earn": np.median(earn_draw),
    "mean predicted earn": np.mean(earn_draw),
    "90% interval lower": np.quantile(earn_draw, 0.05),
    "90% interval upper": np.quantile(earn_draw, 0.95),
})
compound_summary

Pr(predicted earn = 0)        0.118500
median predicted earn     12957.720822
mean predicted earn       18727.113870
90% interval lower            0.000000
90% interval upper        56944.308405
dtype: float64

The compound model is more appropriate than a single normal regression on dollars because it separates the structural spike at zero from the skewed positive part of the distribution.

# Earnings compound model Source: `Earnings/earnings_compound.Rmd` Annual earnings have a point mass at zero and a long right tail among people with positive earnings. The R example fits this as a two-part model: a logistic regression for whether earnings are positive, followed by a regression for log earnings conditional on positive earnings. This Python port uses lightweight Bayesian GLM helpers for the two fitted models and posterior predictive simulation. ## Setup and data ```{python} from pathlib import Path import numpy as np import pandas as pd import matplotlib.pyplot as plt from python.bayes_glm import bayes_lm, bayes_logit root = Path("../../ROS-Examples") earnings = pd.read_csv(root / "Earnings/data/earnings.csv") earnings["positive_earn"] = (earnings["earn"] > 0).astype(int) pos = earnings.loc[earnings["positive_earn"] == 1].copy() pos["log_earn"] = np.log(pos["earn"]) rng = np.random.default_rng(15015) earnings.head() ``` ```{python} earnings[["earn", "positive_earn", "height", "male"]].describe() ``` ## Part 1: probability of nonzero earnings ```{python} fit_positive = bayes_logit("positive_earn ~ height + male", data=earnings, draws=4000, prior_scale=2.5, seed=15015) fit_positive.summary().round(3) ``` ```{python} positive_summary = fit_positive.summary().round(3) positive_summary["odds_ratio_mean"] = np.exp(fit_positive.beta_draws).mean(axis=0) positive_summary.round(3) ``` ## Part 2: log earnings among earners ```{python} fit_logearn = bayes_lm("log_earn ~ height + male", data=pos, draws=4000, prior_scale=10.0, seed=15016) fit_logearn.summary().round(3) ``` The first model predicts the chance of being above zero. The second predicts the conditional distribution on the log scale for respondents whose earnings are positive. ## Predictions for a new person The R page predicts earnings for a new person with `height = 68` and `male = 0`. We reproduce the compound prediction by drawing coefficients for both models, drawing a Bernoulli indicator from the logistic part, and drawing log earnings from the normal linear model. ```{python} new_person = pd.DataFrame({"height": [68.0], "male": [0]}) p_positive = fit_positive.epred(new_person)[:, 0] positive_draw = rng.binomial(1, p_positive) log_earn_draw = fit_logearn.predict(new_person, seed=15017)[:, 0] earn_draw = np.where(positive_draw == 1, np.exp(log_earn_draw), 0.0) pd.Series(earn_draw).describe(percentiles=[0.05, 0.25, 0.5, 0.75, 0.95]) ``` ```{python} fig, axes = plt.subplots(1, 2, figsize=(11, 4)) axes[0].hist(p_positive, bins=40, color="0.75", edgecolor="white") axes[0].set_xlabel("Pr(earn > 0 | height=68, male=0)") axes[0].set_ylabel("draws") positive_pred = earn_draw[earn_draw > 0] axes[1].hist(positive_pred, bins=50, color="0.75", edgecolor="white") axes[1].set_xlabel("predicted annual earnings, conditional draws above zero") axes[1].set_ylabel("draws") ``` ```{python} compound_summary = pd.Series({ "Pr(predicted earn = 0)": np.mean(earn_draw == 0), "median predicted earn": np.median(earn_draw), "mean predicted earn": np.mean(earn_draw), "90% interval lower": np.quantile(earn_draw, 0.05), "90% interval upper": np.quantile(earn_draw, 0.95), }) compound_summary ``` The compound model is more appropriate than a single normal regression on dollars because it separates the structural spike at zero from the skewed positive part of the distribution.