Elections Economy: coverage simulation

Source: ElectionsEconomy/hibbs_coverage.Rmd

This example checks a model-fitting procedure by fake-data simulation. We create data from a known regression line, fit the model back to each fake dataset, and record how often nominal intervals for the slope contain the true value.

The original page uses rstanarm::stan_glm. Here the model is the same Gaussian linear regression, fit with statsmodels OLS. For this simple problem, the usual t intervals and the flat-prior Bayesian intervals are nearly the same.

Setup and data

Code

from pathlib import Path
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

root = Path("../../ROS-Examples")
hibbs = pd.read_csv(root / "ElectionsEconomy/data/hibbs.dat", sep=r"\s+")
rng = np.random.default_rng(1)
hibbs.head()

	year	growth	vote	inc_party_candidate	other_candidate
0	1952	2.40	44.60	Stevenson	Eisenhower
1	1956	2.89	57.76	Eisenhower	Stevenson
2	1960	0.85	49.91	Nixon	Kennedy
3	1964	4.21	61.34	Johnson	Goldwater
4	1968	3.02	49.60	Humphrey	Nixon

Step 1: create the pretend world

Code

a = 46.2
b = 3.1
sigma = 3.8
x = hibbs["growth"].to_numpy()
n = len(x)
pretend = pd.Series({"intercept": a, "slope": b, "sigma": sigma, "n": n})
pretend

intercept    46.2
slope         3.1
sigma         3.8
n            16.0
dtype: float64

Step 2: simulate one fake dataset

Code

y = a + b * x + rng.normal(0, sigma, size=n)
fake = pd.DataFrame({"x": x, "y": y})
fake.head()

	x	y
0	2.40	54.953220
1	2.89	58.281149
2	0.85	50.090661
3	4.21	54.299003
4	3.02	59.002352

Step 3: fit the model and check intervals

Code

fit = smf.ols("y ~ x", data=fake).fit()
fit.summary()

OLS Regression Results
Dep. Variable:	y	R-squared:	0.758
Model:	OLS	Adj. R-squared:	0.740
Method:	Least Squares	F-statistic:	43.79
Date:	Sun, 31 May 2026	Prob (F-statistic):	1.16e-05
Time:	23:02:00	Log-Likelihood:	-35.690
No. Observations:	16	AIC:	75.38
Df Residuals:	14	BIC:	76.93
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	46.9749	1.037	45.278	0.000	44.750	49.200
x	2.9471	0.445	6.617	0.000	1.992	3.902

Omnibus:	2.138	Durbin-Watson:	2.423
Prob(Omnibus):	0.343	Jarque-Bera (JB):	1.558
Skew:	-0.731	Prob(JB):	0.459
Kurtosis:	2.552	Cond. No.	4.54

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Code

def slope_interval(model, level):
    alpha = 1 - level
    lo, hi = model.conf_int(alpha=alpha).loc["x"]
    return float(lo), float(hi)

for level in [0.50, 0.90]:
    lo, hi = slope_interval(fit, level)
    print(f"{int(level * 100)}% interval: ({lo:.2f}, {hi:.2f}); covers true slope? {lo < b < hi}")

50% interval: (2.64, 3.26); covers true slope? True
90% interval: (2.16, 3.73); covers true slope? True

Step 4: repeat the fake-data experiment

Code

def one_coverage_sim(seed):
    local_rng = np.random.default_rng(seed)
    y_sim = a + b * x + local_rng.normal(0, sigma, size=n)
    model = smf.ols("y ~ x", data=pd.DataFrame({"x": x, "y": y_sim})).fit()

    covers = {}
    for level in [0.50, 0.68, 0.90, 0.95]:
        lo, hi = slope_interval(model, level)
        covers[f"ci_{int(level * 100)}"] = lo < b < hi

    slope_hat = model.params["x"]
    slope_se = model.bse["x"]
    covers["se_68"] = abs(b - slope_hat) < slope_se
    covers["se_95"] = abs(b - slope_hat) < 2 * slope_se
    return covers

n_fake = 1000
coverage = pd.DataFrame([one_coverage_sim(s) for s in range(1, n_fake + 1)])
coverage.mean()

ci_50    0.516
ci_68    0.703
ci_90    0.906
ci_95    0.947
se_68    0.691
se_95    0.939
dtype: float64

Code

coverage_summary = coverage.mean().rename({
    "ci_50": "50% interval",
    "ci_68": "68% interval",
    "ci_90": "90% interval",
    "ci_95": "95% interval",
    "se_68": "estimate +/- 1 se",
    "se_95": "estimate +/- 2 se",
})
coverage_summary.to_frame("empirical coverage")

	empirical coverage
50% interval	0.516
68% interval	0.703
90% interval	0.906
95% interval	0.947
estimate +/- 1 se	0.691
estimate +/- 2 se	0.939

Because the fake data were generated from the same linear-normal model that is fit back to the simulations, empirical coverage should be close to the nominal interval levels, up to Monte Carlo error from using only 1000 fake datasets.

# Elections Economy: coverage simulation Source: `ElectionsEconomy/hibbs_coverage.Rmd` This example checks a model-fitting procedure by fake-data simulation. We create data from a known regression line, fit the model back to each fake dataset, and record how often nominal intervals for the slope contain the true value. The original page uses `rstanarm::stan_glm`. Here the model is the same Gaussian linear regression, fit with `statsmodels` OLS. For this simple problem, the usual t intervals and the flat-prior Bayesian intervals are nearly the same. ## Setup and data ```{python} from pathlib import Path import numpy as np import pandas as pd import statsmodels.formula.api as smf root = Path("../../ROS-Examples") hibbs = pd.read_csv(root / "ElectionsEconomy/data/hibbs.dat", sep=r"\s+") rng = np.random.default_rng(1) hibbs.head() ``` ## Step 1: create the pretend world ```{python} a = 46.2 b = 3.1 sigma = 3.8 x = hibbs["growth"].to_numpy() n = len(x) pretend = pd.Series({"intercept": a, "slope": b, "sigma": sigma, "n": n}) pretend ``` ## Step 2: simulate one fake dataset ```{python} y = a + b * x + rng.normal(0, sigma, size=n) fake = pd.DataFrame({"x": x, "y": y}) fake.head() ``` ## Step 3: fit the model and check intervals ```{python} fit = smf.ols("y ~ x", data=fake).fit() fit.summary() ``` ```{python} def slope_interval(model, level): alpha = 1 - level lo, hi = model.conf_int(alpha=alpha).loc["x"] return float(lo), float(hi) for level in [0.50, 0.90]: lo, hi = slope_interval(fit, level) print(f"{int(level * 100)}% interval: ({lo:.2f}, {hi:.2f}); covers true slope? {lo < b < hi}") ``` ## Step 4: repeat the fake-data experiment ```{python} def one_coverage_sim(seed): local_rng = np.random.default_rng(seed) y_sim = a + b * x + local_rng.normal(0, sigma, size=n) model = smf.ols("y ~ x", data=pd.DataFrame({"x": x, "y": y_sim})).fit() covers = {} for level in [0.50, 0.68, 0.90, 0.95]: lo, hi = slope_interval(model, level) covers[f"ci_{int(level * 100)}"] = lo < b < hi slope_hat = model.params["x"] slope_se = model.bse["x"] covers["se_68"] = abs(b - slope_hat) < slope_se covers["se_95"] = abs(b - slope_hat) < 2 * slope_se return covers n_fake = 1000 coverage = pd.DataFrame([one_coverage_sim(s) for s in range(1, n_fake + 1)]) coverage.mean() ``` ```{python} coverage_summary = coverage.mean().rename({ "ci_50": "50% interval", "ci_68": "68% interval", "ci_90": "90% interval", "ci_95": "95% interval", "se_68": "estimate +/- 1 se", "se_95": "estimate +/- 2 se", }) coverage_summary.to_frame("empirical coverage") ``` Because the fake data were generated from the same linear-normal model that is fit back to the simulations, empirical coverage should be close to the nominal interval levels, up to Monte Carlo error from using only 1000 fake datasets.