Congress / predictive uncertainty

Source: Congress/congress.Rmd

This page ports the congressional-election example from Regression and Other Stories. The original uses rstanarm::stan_glm to fit a Gaussian regression for Democratic two-party vote share in 1988, then propagates posterior predictive uncertainty to the 1990 House elections. The Python version keeps statsmodels for least-squares summaries and uses the lightweight Bayesian linear-regression helper for posterior predictive simulation.

Setup and data

Code
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from python.bayes_glm import bayes_lm


def ros_root():
    candidates = [
        Path("../../ROS-Examples"),
        Path("../ROS-Examples"),
        Path("/Users/alal/tmp/ros-python-book/ROS-Examples"),
    ]
    for candidate in candidates:
        if candidate.exists():
            return candidate
    return candidates[0]

root = ros_root()
congress = pd.read_csv(root / "Congress/data/congress.csv")
congress.head()
inc86 inc88 inc90 v86 v88 v90 v86_adj v88_adj v90_adj
0 1 1 1 0.745036 0.772443 0.714029 0.745036 0.772443 0.714029
1 1 1 1 0.673845 0.636182 0.597050 0.673845 0.636182 0.597050
2 1 1 0 0.696457 0.664928 0.521043 0.696457 0.664928 0.521043
3 -1 -1 -1 0.464590 0.273834 0.234377 0.464590 0.273834 0.234377
4 -1 -1 0 0.391095 0.263613 0.477439 0.391095 0.263613 0.477439
Code
congress[["v86", "v88", "v90", "v86_adj", "v88_adj", "v90_adj", "inc88", "inc90"]].describe()
v86 v88 v90 v86_adj v88_adj v90_adj inc88 inc90
count 435.000000 435.000000 435.000000 435.000000 435.000000 435.000000 435.000000 435.000000
mean 0.573743 0.570189 0.554143 0.550386 0.543554 0.545543 0.195402 0.204598
std 0.263335 0.268501 0.263192 0.198182 0.199307 0.178325 0.945997 0.944046
min 0.000000 0.000000 0.000000 0.119715 0.172718 0.224384 -1.000000 -1.000000
25% 0.357020 0.334242 0.381630 0.357020 0.334242 0.381630 -1.000000 -1.000000
50% 0.591864 0.614920 0.582861 0.591864 0.614920 0.582861 1.000000 1.000000
75% 0.740939 0.735249 0.701648 0.740939 0.735249 0.701648 1.000000 1.000000
max 1.000000 1.000000 1.000000 0.866843 0.888181 0.849922 1.000000 1.000000

The raw vote shares contain uncontested races coded near 0 or 1. The adjusted variables keep the same substantive scale but are better suited for the linear regression used in the book.

Predict 1988 vote from 1986 vote and incumbency

Code
data88 = congress.rename(columns={"v88_adj": "vote", "v86_adj": "past_vote", "inc88": "inc"})
fit88 = smf.ols("vote ~ past_vote + inc", data=data88).fit()
fit88_bayes = bayes_lm("vote ~ past_vote + inc", data=data88, draws=4000, prior_scale=10.0, seed=84735)
fit88.summary()
OLS Regression Results
Dep. Variable: vote R-squared: 0.887
Model: OLS Adj. R-squared: 0.886
Method: Least Squares F-statistic: 1696.
Date: Sun, 31 May 2026 Prob (F-statistic): 2.82e-205
Time: 23:01:20 Log-Likelihood: 559.14
No. Observations: 435 AIC: -1112.
Df Residuals: 432 BIC: -1100.
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 0.2380 0.017 14.068 0.000 0.205 0.271
past_vote 0.5210 0.032 16.169 0.000 0.458 0.584
inc 0.0964 0.007 14.284 0.000 0.083 0.110
Omnibus: 16.200 Durbin-Watson: 1.933
Prob(Omnibus): 0.000 Jarque-Bera (JB): 36.720
Skew: -0.074 Prob(JB): 1.06e-08
Kurtosis: 4.416 Cond. No. 13.9


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Code
coef_table = pd.DataFrame({
    "estimate": fit88.params,
    "std_error": fit88.bse,
    "t": fit88.tvalues,
})
coef_table
estimate std_error t
Intercept 0.237990 0.016917 14.068245
past_vote 0.520951 0.032219 16.169101
inc 0.096413 0.006750 14.283999

Predict the 1990 election district by district

The R code calls posterior_predict(fit88, newdata=data90). For this Gaussian linear model the helper draws coefficients and residual scale from the conjugate posterior and then generates replicated 1990 vote shares.

Code
rng = np.random.default_rng(84735)
data90 = pd.DataFrame({
    "past_vote": congress["v88_adj"],
    "inc": congress["inc90"],
})
pred90 = fit88_bayes.predict(data90, seed=84736)
dems_pred = (pred90 > 0.5).sum(axis=1)

pd.Series(dems_pred).describe()[["mean", "std", "min", "25%", "50%", "75%", "max"]]
mean    260.003500
std       2.508896
min     251.000000
25%     258.000000
50%     260.000000
75%     262.000000
max     269.000000
dtype: float64
Code
fig, ax = plt.subplots(figsize=(6, 3.5))
ax.hist(dems_pred, bins=np.arange(dems_pred.min() - 0.5, dems_pred.max() + 1.5), color="0.75", edgecolor="white")
ax.axvline(dems_pred.mean(), color="black", linewidth=2, label=f"mean = {dems_pred.mean():.1f}")
ax.set_xlabel("Predicted Democratic seats in 1990")
ax.set_ylabel("Posterior predictive simulations")
ax.legend(frameon=False)

Vote-share graphics

Code
def jitter_extremes(vote, rng, low=0.1, high=0.9):
    vote = np.asarray(vote)
    out = vote.astype(float).copy()
    out[vote < low] = rng.uniform(0.01, 0.04, size=(vote < low).sum())
    out[vote > high] = rng.uniform(0.96, 0.99, size=(vote > high).sum())
    return out

fig, axes = plt.subplots(1, 3, figsize=(13, 3.8))

v88_hist = congress["v88"].clip(0.0001, 0.9999)
axes[0].hist(v88_hist, bins=np.arange(0, 1.05, 0.05), color="0.75", edgecolor="white")
axes[0].set_xlabel("Democratic share of two-party vote")
axes[0].set_yticks([])
axes[0].set_title("Congressional elections in 1988")

j_v86 = jitter_extremes(congress["v86"], rng)
j_v88 = jitter_extremes(congress["v88"], rng)
for inc, marker, label in [(0, "o", "open seat"), (1, "o", "Dem. incumbent"), (-1, "x", "Rep. incumbent")]:
    ok = congress["inc88"] == inc
    kwargs = {"s": 25, "label": label}
    if inc == 0:
        axes[1].scatter(j_v86[ok], j_v88[ok], facecolors="none", edgecolors="black", **kwargs)
    elif inc == 1:
        axes[1].scatter(j_v86[ok], j_v88[ok], color="black", marker=marker, **kwargs)
    else:
        axes[1].scatter(j_v86[ok], j_v88[ok], color="black", marker=marker, **kwargs)
axes[1].plot([0, 1], [0, 1], color="0.4", linewidth=0.8)
axes[1].set_xlabel("Democratic vote share in 1986")
axes[1].set_ylabel("Democratic vote share in 1988")
axes[1].set_title("Raw data")

for inc, marker, label in [(0, "o", "open seat"), (1, "o", "Dem. incumbent"), (-1, "x", "Rep. incumbent")]:
    ok = congress["inc88"] == inc
    if inc == 0:
        axes[2].scatter(congress.loc[ok, "v86_adj"], congress.loc[ok, "v88_adj"], facecolors="none", edgecolors="black", s=25, label=label)
    elif inc == 1:
        axes[2].scatter(congress.loc[ok, "v86_adj"], congress.loc[ok, "v88_adj"], color="black", s=25, label=label)
    else:
        axes[2].scatter(congress.loc[ok, "v86_adj"], congress.loc[ok, "v88_adj"], color="black", marker=marker, s=25, label=label)
axes[2].plot([0, 1], [0, 1], color="0.4", linewidth=0.8)
axes[2].set_xlabel("Adjusted Dem. vote share in 1986")
axes[2].set_ylabel("Adjusted Dem. vote share in 1988")
axes[2].set_title("Adjusted data")
axes[2].legend(frameon=False, fontsize=8)

for ax in axes[1:]:
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)
    ax.set_aspect("equal", adjustable="box")
fig.tight_layout()

CmdStanPy analogue

rstanarm::stan_glm supplies weakly informative priors automatically. A direct CmdStanPy version is useful if the prior needs to be explicit; for the book’s predictive calculation, the helper fit above reproduces the core posterior-predictive workflow without requiring Stan at render time.

Code
from cmdstanpy import CmdStanModel

# A minimal Stan model would put normal priors on alpha/beta and a half-normal
# prior on sigma, then generate y_rep for the 1990 design matrix in
# generated quantities.  Compile and sample it only when exact prior matching is
# important for the exercise.