This example checks a model-fitting procedure by fake-data simulation. We create data from a known regression line, fit the model back to each fake dataset, and record how often nominal intervals for the slope contain the true value.
The original page uses rstanarm::stan_glm. Here the model is the same Gaussian linear regression, fit with statsmodels OLS. For this simple problem, the usual t intervals and the flat-prior Bayesian intervals are nearly the same.
Setup and data
Code
from pathlib import Pathimport numpy as npimport pandas as pdimport statsmodels.formula.api as smfroot = Path("../../ROS-Examples")hibbs = pd.read_csv(root /"ElectionsEconomy/data/hibbs.dat", sep=r"\s+")rng = np.random.default_rng(1)hibbs.head()
year
growth
vote
inc_party_candidate
other_candidate
0
1952
2.40
44.60
Stevenson
Eisenhower
1
1956
2.89
57.76
Eisenhower
Stevenson
2
1960
0.85
49.91
Nixon
Kennedy
3
1964
4.21
61.34
Johnson
Goldwater
4
1968
3.02
49.60
Humphrey
Nixon
Step 1: create the pretend world
Code
a =46.2b =3.1sigma =3.8x = hibbs["growth"].to_numpy()n =len(x)pretend = pd.Series({"intercept": a, "slope": b, "sigma": sigma, "n": n})pretend
intercept 46.2
slope 3.1
sigma 3.8
n 16.0
dtype: float64
Step 2: simulate one fake dataset
Code
y = a + b * x + rng.normal(0, sigma, size=n)fake = pd.DataFrame({"x": x, "y": y})fake.head()
x
y
0
2.40
54.953220
1
2.89
58.281149
2
0.85
50.090661
3
4.21
54.299003
4
3.02
59.002352
Step 3: fit the model and check intervals
Code
fit = smf.ols("y ~ x", data=fake).fit()fit.summary()
OLS Regression Results
Dep. Variable:
y
R-squared:
0.758
Model:
OLS
Adj. R-squared:
0.740
Method:
Least Squares
F-statistic:
43.79
Date:
Sun, 31 May 2026
Prob (F-statistic):
1.16e-05
Time:
23:02:00
Log-Likelihood:
-35.690
No. Observations:
16
AIC:
75.38
Df Residuals:
14
BIC:
76.93
Df Model:
1
Covariance Type:
nonrobust
coef
std err
t
P>|t|
[0.025
0.975]
Intercept
46.9749
1.037
45.278
0.000
44.750
49.200
x
2.9471
0.445
6.617
0.000
1.992
3.902
Omnibus:
2.138
Durbin-Watson:
2.423
Prob(Omnibus):
0.343
Jarque-Bera (JB):
1.558
Skew:
-0.731
Prob(JB):
0.459
Kurtosis:
2.552
Cond. No.
4.54
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Code
def slope_interval(model, level): alpha =1- level lo, hi = model.conf_int(alpha=alpha).loc["x"]returnfloat(lo), float(hi)for level in [0.50, 0.90]: lo, hi = slope_interval(fit, level)print(f"{int(level *100)}% interval: ({lo:.2f}, {hi:.2f}); covers true slope? {lo < b < hi}")
Because the fake data were generated from the same linear-normal model that is fit back to the simulations, empirical coverage should be close to the nominal interval levels, up to Monte Carlo error from using only 1000 fake datasets.
# Elections Economy: coverage simulationSource: `ElectionsEconomy/hibbs_coverage.Rmd`This example checks a model-fitting procedure by fake-data simulation. We create data from a known regression line, fit the model back to each fake dataset, and record how often nominal intervals for the slope contain the true value.The original page uses `rstanarm::stan_glm`. Here the model is the same Gaussian linear regression, fit with `statsmodels` OLS. For this simple problem, the usual t intervals and the flat-prior Bayesian intervals are nearly the same.## Setup and data```{python}from pathlib import Pathimport numpy as npimport pandas as pdimport statsmodels.formula.api as smfroot = Path("../../ROS-Examples")hibbs = pd.read_csv(root /"ElectionsEconomy/data/hibbs.dat", sep=r"\s+")rng = np.random.default_rng(1)hibbs.head()```## Step 1: create the pretend world```{python}a =46.2b =3.1sigma =3.8x = hibbs["growth"].to_numpy()n =len(x)pretend = pd.Series({"intercept": a, "slope": b, "sigma": sigma, "n": n})pretend```## Step 2: simulate one fake dataset```{python}y = a + b * x + rng.normal(0, sigma, size=n)fake = pd.DataFrame({"x": x, "y": y})fake.head()```## Step 3: fit the model and check intervals```{python}fit = smf.ols("y ~ x", data=fake).fit()fit.summary()``````{python}def slope_interval(model, level): alpha =1- level lo, hi = model.conf_int(alpha=alpha).loc["x"]returnfloat(lo), float(hi)for level in [0.50, 0.90]: lo, hi = slope_interval(fit, level)print(f"{int(level *100)}% interval: ({lo:.2f}, {hi:.2f}); covers true slope? {lo < b < hi}")```## Step 4: repeat the fake-data experiment```{python}def one_coverage_sim(seed): local_rng = np.random.default_rng(seed) y_sim = a + b * x + local_rng.normal(0, sigma, size=n) model = smf.ols("y ~ x", data=pd.DataFrame({"x": x, "y": y_sim})).fit() covers = {}for level in [0.50, 0.68, 0.90, 0.95]: lo, hi = slope_interval(model, level) covers[f"ci_{int(level *100)}"] = lo < b < hi slope_hat = model.params["x"] slope_se = model.bse["x"] covers["se_68"] =abs(b - slope_hat) < slope_se covers["se_95"] =abs(b - slope_hat) <2* slope_sereturn coversn_fake =1000coverage = pd.DataFrame([one_coverage_sim(s) for s inrange(1, n_fake +1)])coverage.mean()``````{python}coverage_summary = coverage.mean().rename({"ci_50": "50% interval","ci_68": "68% interval","ci_90": "90% interval","ci_95": "95% interval","se_68": "estimate +/- 1 se","se_95": "estimate +/- 2 se",})coverage_summary.to_frame("empirical coverage")```Because the fake data were generated from the same linear-normal model that is fit back to the simulations, empirical coverage should be close to the nominal interval levels, up to Monte Carlo error from using only 1000 fake datasets.