Code
from pathlib import Path
import numpy as np
import pandas as pd
from python.bayes_glm import bayes_logit
rng = np.random.default_rng(1701)Source: Poststrat/poststrat.Rmd
This example reproduces the chapter’s small poststratification calculation for a 2016 CBS poll. The raw poll has unequal partisan composition; poststratification replaces the sample party-ID mix with an assumed population mix and averages predicted Trump support over those cells.
from pathlib import Path
import numpy as np
import pandas as pd
from python.bayes_glm import bayes_logit
rng = np.random.default_rng(1701)The R source builds a poll from the reported effective sample sizes and within-party candidate preferences. vote = 1 denotes Trump support, vote = 0 denotes Clinton support, and NaN collects respondents who preferred neither major-party candidate.
pid_names = ["Republican", "Democrat", "Independent"]
n_pid = np.array([254, 282, 242])
major_party_rates = {
"Republican": (0.77, 0.08),
"Democrat": (0.05, 0.89),
"Independent": (0.36, 0.38),
}
rows = []
cell_means = []
for name, n in zip(pid_names, n_pid):
n_trump, n_clinton = np.round(np.array(major_party_rates[name]) * n).astype(int)
n_other = int(n - n_trump - n_clinton)
votes = np.r_[np.ones(n_trump), np.zeros(n_clinton), np.repeat(np.nan, n_other)]
rows.extend({"pid": name, "vote": v} for v in votes)
cell_means.append(np.nanmean(votes))
poll = pd.DataFrame(rows)
cell_summary = (
poll.groupby("pid", sort=False)["vote"]
.agg(n="size", major_party_n="count", trump_share="mean")
.reindex(pid_names)
)
cell_summary.round(3)| n | major_party_n | trump_share | |
|---|---|---|---|
| pid | |||
| Republican | 254 | 216 | 0.907 |
| Democrat | 282 | 265 | 0.053 |
| Independent | 242 | 179 | 0.486 |
The direct estimate simply averages the poll respondents with a major-party preference. The poststratified estimate averages the three party-specific means using the target population distribution from the R page.
poststrat = pd.DataFrame({
"pid": pid_names,
"N": [0.33, 0.36, 0.31],
"cell_mean": cell_means,
})
pd.Series({
"raw poll mean": poll["vote"].mean(skipna=True),
"poststratified mean": np.average(poststrat["cell_mean"], weights=poststrat["N"]),
}).round(3)raw poll mean 0.450
poststratified mean 0.469
dtype: float64
The adjustment is transparent here: Republicans are somewhat overrepresented relative to the target population while Democrats are underrepresented, so the target-weighted Trump share is lower than the raw sample mean.
poststrat.round(3)| pid | N | cell_mean | |
|---|---|---|---|
| 0 | Republican | 0.33 | 0.907 |
| 1 | Democrat | 0.36 | 0.053 |
| 2 | Independent | 0.31 | 0.486 |
The R code fits stan_glm(vote ~ factor(pid), family=binomial). Use a Bayesian logistic regression with party indicators and weak Normal priors; with a saturated categorical predictor its posterior fitted probabilities are shrunk versions of the party-specific sample proportions.
poll_observed = poll.dropna(subset=["vote"]).copy()
fit = bayes_logit("vote ~ C(pid)", data=poll_observed, draws=4000, prior_scale=2.5, seed=1701)
fit.summary().round(3)| mean | sd | q5 | q50 | q95 | |
|---|---|---|---|---|---|
| Intercept | -2.767 | 0.259 | -3.195 | -2.766 | -2.348 |
| C(pid)[T.Independent] | 2.703 | 0.296 | 2.220 | 2.703 | 3.178 |
| C(pid)[T.Republican] | 5.007 | 0.347 | 4.454 | 5.003 | 5.585 |
poststrat["logit_pred"] = fit.epred(poststrat[["pid"]]).mean(axis=0)
logit_poststrat = np.average(poststrat["logit_pred"], weights=poststrat["N"])
pd.concat([
poststrat[["pid", "N", "cell_mean", "logit_pred"]],
pd.DataFrame({"pid": ["weighted total"], "N": [poststrat["N"].sum()], "cell_mean": [np.average(poststrat["cell_mean"], weights=poststrat["N"])], "logit_pred": [logit_poststrat]})
], ignore_index=True).round(3)| pid | N | cell_mean | logit_pred | |
|---|---|---|---|---|
| 0 | Republican | 0.33 | 0.907 | 0.902 |
| 1 | Democrat | 0.36 | 0.053 | 0.061 |
| 2 | Independent | 0.31 | 0.486 | 0.484 |
| 3 | weighted total | 1.00 | 0.469 | 0.470 |
rstanarm::posterior_epred() averages posterior predictions over the poststratification table. The helper fit exposes the same kind of expected-prediction draws directly.
pred_draws = fit.epred(poststrat[["pid"]])
poststrat_draws = pred_draws @ (poststrat["N"].to_numpy() / poststrat["N"].sum())
pd.Series({
"mean": poststrat_draws.mean(),
"sd": poststrat_draws.std(ddof=1),
"2.5%": np.quantile(poststrat_draws, 0.025),
"97.5%": np.quantile(poststrat_draws, 0.975),
}).round(3)mean 0.470
sd 0.015
2.5% 0.440
97.5% 0.498
dtype: float64
The original page also adds a rough extra uncertainty term of 0.02 to represent uncertainty in the target population proportions. That uncertainty is not estimated from this tiny example; it is an explicit modeling choice about the poststratification frame.
poststrat_draws_with_frame_error = poststrat_draws + rng.normal(0, 0.02, size=len(poststrat_draws))
pd.Series({
"mean": poststrat_draws_with_frame_error.mean(),
"sd": poststrat_draws_with_frame_error.std(ddof=1),
}).round(3)mean 0.470
sd 0.025
dtype: float64