Source: Earnings/earnings_bootstrap.Rmd

This example uses the nonparametric bootstrap to approximate the sampling distribution of a ratio: median earnings among women divided by median earnings among men. The bootstrap resamples rows of the survey data, preserving the joint distribution of earnings and the sex indicator.

Setup and data

Code
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

root = Path("../../ROS-Examples")
earnings = pd.read_csv(root / "Earnings/data/earnings.csv")
rng = np.random.default_rng(20260531)
earnings.head()
height weight male earn earnk ethnicity education mother_education father_education walk exercise smokenow tense angry age
0 74 210.0 1 50000.0 50.0 White 16.0 16.0 16.0 3 3 2.0 0.0 0.0 45
1 66 125.0 0 60000.0 60.0 White 16.0 16.0 16.0 6 5 1.0 0.0 0.0 58
2 64 126.0 0 30000.0 30.0 White 16.0 16.0 16.0 8 1 2.0 1.0 1.0 29
3 65 200.0 0 25000.0 25.0 White 17.0 17.0 NaN 8 1 2.0 0.0 0.0 57
4 63 110.0 0 50000.0 50.0 Other 16.0 16.0 16.0 5 6 2.0 0.0 0.0 91
Code
earnings[["earn", "male", "height", "education", "age"]].describe()
earn male height education age
count 1816.000000 1816.000000 1816.000000 1814.000000 1816.000000
mean 21147.296256 0.371696 66.568833 13.235391 42.934471
std 22531.765123 0.483391 3.831822 2.556638 17.161672
min 0.000000 0.000000 57.000000 2.000000 18.000000
25% 6000.000000 0.000000 64.000000 12.000000 29.000000
50% 16000.000000 0.000000 66.000000 12.000000 39.000000
75% 27000.000000 1.000000 69.250000 15.000000 56.000000
max 400000.000000 1.000000 82.000000 18.000000 91.000000

Ratio in the observed sample

Code
def female_to_male_median_ratio(data):
    female_median = data.loc[data["male"] == 0, "earn"].median()
    male_median = data.loc[data["male"] == 1, "earn"].median()
    return female_median / male_median

observed_ratio = female_to_male_median_ratio(earnings)
observed_ratio
np.float64(0.6)

The statistic is deliberately simple. The bootstrap question is: if this dataset were a stand-in for the population, how much would this ratio move under repeated samples of the same size?

A single bootstrap resample

Code
n = len(earnings)
boot_index = rng.integers(0, n, size=n)
earnings_boot = earnings.iloc[boot_index]
ratio_boot = female_to_male_median_ratio(earnings_boot)
ratio_boot
np.float64(0.6)

Bootstrap distribution

Code
def bootstrap_ratio(data, rng):
    boot_index = rng.integers(0, len(data), size=len(data))
    return female_to_male_median_ratio(data.iloc[boot_index])

n_sims = 10_000
boot_ratios = np.array([bootstrap_ratio(earnings, rng) for _ in range(n_sims)])

pd.Series(boot_ratios).describe(percentiles=[0.025, 0.25, 0.5, 0.75, 0.975])
count    10000.000000
mean         0.587055
std          0.027158
min          0.444444
2.5%         0.520000
25%          0.580000
50%          0.600000
75%          0.600000
97.5%        0.600000
max          0.714286
dtype: float64
Code
fig, ax = plt.subplots()
ax.hist(boot_ratios, bins=45, color="0.75", edgecolor="white")
ax.axvline(observed_ratio, color="black", linewidth=2, label="observed ratio")
ax.set_xlabel("median(earn | female) / median(earn | male)")
ax.set_ylabel("bootstrap simulations")
ax.legend(frameon=False)

Code
bootstrap_se = boot_ratios.std(ddof=1)
bootstrap_interval = np.quantile(boot_ratios, [0.025, 0.975])
bootstrap_se, bootstrap_interval
(np.float64(0.02715803087636638), array([0.52, 0.6 ]))

The standard deviation of the simulated ratios is the bootstrap standard error. The percentile interval is not a model-based confidence interval; it is the direct middle 95% of the resampled statistics.