Poststratification with party identification

Source: Poststrat/poststrat.Rmd

This example reproduces the chapter’s small poststratification calculation for a 2016 CBS poll. The raw poll has unequal partisan composition; poststratification replaces the sample party-ID mix with an assumed population mix and averages predicted Trump support over those cells.

Setup and simulated poll

Code
from pathlib import Path
import numpy as np
import pandas as pd
from python.bayes_glm import bayes_logit

rng = np.random.default_rng(1701)

The R source builds a poll from the reported effective sample sizes and within-party candidate preferences. vote = 1 denotes Trump support, vote = 0 denotes Clinton support, and NaN collects respondents who preferred neither major-party candidate.

Code
pid_names = ["Republican", "Democrat", "Independent"]
n_pid = np.array([254, 282, 242])
major_party_rates = {
    "Republican": (0.77, 0.08),
    "Democrat": (0.05, 0.89),
    "Independent": (0.36, 0.38),
}

rows = []
cell_means = []
for name, n in zip(pid_names, n_pid):
    n_trump, n_clinton = np.round(np.array(major_party_rates[name]) * n).astype(int)
    n_other = int(n - n_trump - n_clinton)
    votes = np.r_[np.ones(n_trump), np.zeros(n_clinton), np.repeat(np.nan, n_other)]
    rows.extend({"pid": name, "vote": v} for v in votes)
    cell_means.append(np.nanmean(votes))

poll = pd.DataFrame(rows)
cell_summary = (
    poll.groupby("pid", sort=False)["vote"]
    .agg(n="size", major_party_n="count", trump_share="mean")
    .reindex(pid_names)
)
cell_summary.round(3)
n major_party_n trump_share
pid
Republican 254 216 0.907
Democrat 282 265 0.053
Independent 242 179 0.486

Direct and poststratified estimates

The direct estimate simply averages the poll respondents with a major-party preference. The poststratified estimate averages the three party-specific means using the target population distribution from the R page.

Code
poststrat = pd.DataFrame({
    "pid": pid_names,
    "N": [0.33, 0.36, 0.31],
    "cell_mean": cell_means,
})

pd.Series({
    "raw poll mean": poll["vote"].mean(skipna=True),
    "poststratified mean": np.average(poststrat["cell_mean"], weights=poststrat["N"]),
}).round(3)
raw poll mean          0.450
poststratified mean    0.469
dtype: float64

The adjustment is transparent here: Republicans are somewhat overrepresented relative to the target population while Democrats are underrepresented, so the target-weighted Trump share is lower than the raw sample mean.

Code
poststrat.round(3)
pid N cell_mean
0 Republican 0.33 0.907
1 Democrat 0.36 0.053
2 Independent 0.31 0.486

Logistic regression version

The R code fits stan_glm(vote ~ factor(pid), family=binomial). Use a Bayesian logistic regression with party indicators and weak Normal priors; with a saturated categorical predictor its posterior fitted probabilities are shrunk versions of the party-specific sample proportions.

Code
poll_observed = poll.dropna(subset=["vote"]).copy()
fit = bayes_logit("vote ~ C(pid)", data=poll_observed, draws=4000, prior_scale=2.5, seed=1701)

fit.summary().round(3)
mean sd q5 q50 q95
Intercept -2.767 0.259 -3.195 -2.766 -2.348
C(pid)[T.Independent] 2.703 0.296 2.220 2.703 3.178
C(pid)[T.Republican] 5.007 0.347 4.454 5.003 5.585
Code
poststrat["logit_pred"] = fit.epred(poststrat[["pid"]]).mean(axis=0)
logit_poststrat = np.average(poststrat["logit_pred"], weights=poststrat["N"])

pd.concat([
    poststrat[["pid", "N", "cell_mean", "logit_pred"]],
    pd.DataFrame({"pid": ["weighted total"], "N": [poststrat["N"].sum()], "cell_mean": [np.average(poststrat["cell_mean"], weights=poststrat["N"])], "logit_pred": [logit_poststrat]})
], ignore_index=True).round(3)
pid N cell_mean logit_pred
0 Republican 0.33 0.907 0.902
1 Democrat 0.36 0.053 0.061
2 Independent 0.31 0.486 0.484
3 weighted total 1.00 0.469 0.470

Posterior uncertainty

rstanarm::posterior_epred() averages posterior predictions over the poststratification table. The helper fit exposes the same kind of expected-prediction draws directly.

Code
pred_draws = fit.epred(poststrat[["pid"]])
poststrat_draws = pred_draws @ (poststrat["N"].to_numpy() / poststrat["N"].sum())

pd.Series({
    "mean": poststrat_draws.mean(),
    "sd": poststrat_draws.std(ddof=1),
    "2.5%": np.quantile(poststrat_draws, 0.025),
    "97.5%": np.quantile(poststrat_draws, 0.975),
}).round(3)
mean     0.470
sd       0.015
2.5%     0.440
97.5%    0.498
dtype: float64

The original page also adds a rough extra uncertainty term of 0.02 to represent uncertainty in the target population proportions. That uncertainty is not estimated from this tiny example; it is an explicit modeling choice about the poststratification frame.

Code
poststrat_draws_with_frame_error = poststrat_draws + rng.normal(0, 0.02, size=len(poststrat_draws))
pd.Series({
    "mean": poststrat_draws_with_frame_error.mean(),
    "sd": poststrat_draws_with_frame_error.std(ddof=1),
}).round(3)
mean    0.470
sd      0.025
dtype: float64