Earnings bootstrap

Source: Earnings/earnings_bootstrap.Rmd

This example uses the nonparametric bootstrap to approximate the sampling distribution of a ratio: median earnings among women divided by median earnings among men. The bootstrap resamples rows of the survey data, preserving the joint distribution of earnings and the sex indicator.

Setup and data

Code

from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

root = Path("../../ROS-Examples")
earnings = pd.read_csv(root / "Earnings/data/earnings.csv")
rng = np.random.default_rng(20260531)
earnings.head()

	height	weight	male	earn	earnk	ethnicity	education	mother_education	father_education	walk	exercise	smokenow	tense	angry	age
0	74	210.0	1	50000.0	50.0	White	16.0	16.0	16.0	3	3	2.0	0.0	0.0	45
1	66	125.0	0	60000.0	60.0	White	16.0	16.0	16.0	6	5	1.0	0.0	0.0	58
2	64	126.0	0	30000.0	30.0	White	16.0	16.0	16.0	8	1	2.0	1.0	1.0	29
3	65	200.0	0	25000.0	25.0	White	17.0	17.0	NaN	8	1	2.0	0.0	0.0	57
4	63	110.0	0	50000.0	50.0	Other	16.0	16.0	16.0	5	6	2.0	0.0	0.0	91

Code

earnings[["earn", "male", "height", "education", "age"]].describe()

	earn	male	height	education	age
count	1816.000000	1816.000000	1816.000000	1814.000000	1816.000000
mean	21147.296256	0.371696	66.568833	13.235391	42.934471
std	22531.765123	0.483391	3.831822	2.556638	17.161672
min	0.000000	0.000000	57.000000	2.000000	18.000000
25%	6000.000000	0.000000	64.000000	12.000000	29.000000
50%	16000.000000	0.000000	66.000000	12.000000	39.000000
75%	27000.000000	1.000000	69.250000	15.000000	56.000000
max	400000.000000	1.000000	82.000000	18.000000	91.000000

Ratio in the observed sample

Code

def female_to_male_median_ratio(data):
    female_median = data.loc[data["male"] == 0, "earn"].median()
    male_median = data.loc[data["male"] == 1, "earn"].median()
    return female_median / male_median

observed_ratio = female_to_male_median_ratio(earnings)
observed_ratio

np.float64(0.6)

The statistic is deliberately simple. The bootstrap question is: if this dataset were a stand-in for the population, how much would this ratio move under repeated samples of the same size?

A single bootstrap resample

Code

n = len(earnings)
boot_index = rng.integers(0, n, size=n)
earnings_boot = earnings.iloc[boot_index]
ratio_boot = female_to_male_median_ratio(earnings_boot)
ratio_boot

np.float64(0.6)

Bootstrap distribution

Code

def bootstrap_ratio(data, rng):
    boot_index = rng.integers(0, len(data), size=len(data))
    return female_to_male_median_ratio(data.iloc[boot_index])

n_sims = 10_000
boot_ratios = np.array([bootstrap_ratio(earnings, rng) for _ in range(n_sims)])

pd.Series(boot_ratios).describe(percentiles=[0.025, 0.25, 0.5, 0.75, 0.975])

count    10000.000000
mean         0.587055
std          0.027158
min          0.444444
2.5%         0.520000
25%          0.580000
50%          0.600000
75%          0.600000
97.5%        0.600000
max          0.714286
dtype: float64

Code

fig, ax = plt.subplots()
ax.hist(boot_ratios, bins=45, color="0.75", edgecolor="white")
ax.axvline(observed_ratio, color="black", linewidth=2, label="observed ratio")
ax.set_xlabel("median(earn | female) / median(earn | male)")
ax.set_ylabel("bootstrap simulations")
ax.legend(frameon=False)

Code

bootstrap_se = boot_ratios.std(ddof=1)
bootstrap_interval = np.quantile(boot_ratios, [0.025, 0.975])
bootstrap_se, bootstrap_interval

(np.float64(0.02715803087636638), array([0.52, 0.6 ]))

The standard deviation of the simulated ratios is the bootstrap standard error. The percentile interval is not a model-based confidence interval; it is the direct middle 95% of the resampled statistics.

# Earnings bootstrap Source: `Earnings/earnings_bootstrap.Rmd` This example uses the nonparametric bootstrap to approximate the sampling distribution of a ratio: median earnings among women divided by median earnings among men. The bootstrap resamples rows of the survey data, preserving the joint distribution of earnings and the sex indicator. ## Setup and data ```{python} from pathlib import Path import numpy as np import pandas as pd import matplotlib.pyplot as plt root = Path("../../ROS-Examples") earnings = pd.read_csv(root / "Earnings/data/earnings.csv") rng = np.random.default_rng(20260531) earnings.head() ``` ```{python} earnings[["earn", "male", "height", "education", "age"]].describe() ``` ## Ratio in the observed sample ```{python} def female_to_male_median_ratio(data): female_median = data.loc[data["male"] == 0, "earn"].median() male_median = data.loc[data["male"] == 1, "earn"].median() return female_median / male_median observed_ratio = female_to_male_median_ratio(earnings) observed_ratio ``` The statistic is deliberately simple. The bootstrap question is: if this dataset were a stand-in for the population, how much would this ratio move under repeated samples of the same size? ## A single bootstrap resample ```{python} n = len(earnings) boot_index = rng.integers(0, n, size=n) earnings_boot = earnings.iloc[boot_index] ratio_boot = female_to_male_median_ratio(earnings_boot) ratio_boot ``` ## Bootstrap distribution ```{python} def bootstrap_ratio(data, rng): boot_index = rng.integers(0, len(data), size=len(data)) return female_to_male_median_ratio(data.iloc[boot_index]) n_sims = 10_000 boot_ratios = np.array([bootstrap_ratio(earnings, rng) for _ in range(n_sims)]) pd.Series(boot_ratios).describe(percentiles=[0.025, 0.25, 0.5, 0.75, 0.975]) ``` ```{python} fig, ax = plt.subplots() ax.hist(boot_ratios, bins=45, color="0.75", edgecolor="white") ax.axvline(observed_ratio, color="black", linewidth=2, label="observed ratio") ax.set_xlabel("median(earn | female) / median(earn | male)") ax.set_ylabel("bootstrap simulations") ax.legend(frameon=False) ``` ```{python} bootstrap_se = boot_ratios.std(ddof=1) bootstrap_interval = np.quantile(boot_ratios, [0.025, 0.975]) bootstrap_se, bootstrap_interval ``` The standard deviation of the simulated ratios is the bootstrap standard error. The percentile interval is not a model-based confidence interval; it is the direct middle 95% of the resampled statistics.