Poststratification with party identification

Source: Poststrat/poststrat.Rmd

This example reproduces the chapter’s small poststratification calculation for a 2016 CBS poll. The raw poll has unequal partisan composition; poststratification replaces the sample party-ID mix with an assumed population mix and averages predicted Trump support over those cells.

Setup and simulated poll

Code

from pathlib import Path
import numpy as np
import pandas as pd
from python.bayes_glm import bayes_logit

rng = np.random.default_rng(1701)

The R source builds a poll from the reported effective sample sizes and within-party candidate preferences. vote = 1 denotes Trump support, vote = 0 denotes Clinton support, and NaN collects respondents who preferred neither major-party candidate.

Code

pid_names = ["Republican", "Democrat", "Independent"]
n_pid = np.array([254, 282, 242])
major_party_rates = {
    "Republican": (0.77, 0.08),
    "Democrat": (0.05, 0.89),
    "Independent": (0.36, 0.38),
}

rows = []
cell_means = []
for name, n in zip(pid_names, n_pid):
    n_trump, n_clinton = np.round(np.array(major_party_rates[name]) * n).astype(int)
    n_other = int(n - n_trump - n_clinton)
    votes = np.r_[np.ones(n_trump), np.zeros(n_clinton), np.repeat(np.nan, n_other)]
    rows.extend({"pid": name, "vote": v} for v in votes)
    cell_means.append(np.nanmean(votes))

poll = pd.DataFrame(rows)
cell_summary = (
    poll.groupby("pid", sort=False)["vote"]
    .agg(n="size", major_party_n="count", trump_share="mean")
    .reindex(pid_names)
)
cell_summary.round(3)

	n	major_party_n	trump_share
pid
Republican	254	216	0.907
Democrat	282	265	0.053
Independent	242	179	0.486

Direct and poststratified estimates

The direct estimate simply averages the poll respondents with a major-party preference. The poststratified estimate averages the three party-specific means using the target population distribution from the R page.

Code

poststrat = pd.DataFrame({
    "pid": pid_names,
    "N": [0.33, 0.36, 0.31],
    "cell_mean": cell_means,
})

pd.Series({
    "raw poll mean": poll["vote"].mean(skipna=True),
    "poststratified mean": np.average(poststrat["cell_mean"], weights=poststrat["N"]),
}).round(3)

raw poll mean          0.450
poststratified mean    0.469
dtype: float64

The adjustment is transparent here: Republicans are somewhat overrepresented relative to the target population while Democrats are underrepresented, so the target-weighted Trump share is lower than the raw sample mean.

Code

poststrat.round(3)

	pid	N	cell_mean
0	Republican	0.33	0.907
1	Democrat	0.36	0.053
2	Independent	0.31	0.486

Logistic regression version

The R code fits stan_glm(vote ~ factor(pid), family=binomial). Use a Bayesian logistic regression with party indicators and weak Normal priors; with a saturated categorical predictor its posterior fitted probabilities are shrunk versions of the party-specific sample proportions.

Code

poll_observed = poll.dropna(subset=["vote"]).copy()
fit = bayes_logit("vote ~ C(pid)", data=poll_observed, draws=4000, prior_scale=2.5, seed=1701)

fit.summary().round(3)

	mean	sd	q5	q50	q95
Intercept	-2.767	0.259	-3.195	-2.766	-2.348
C(pid)[T.Independent]	2.703	0.296	2.220	2.703	3.178
C(pid)[T.Republican]	5.007	0.347	4.454	5.003	5.585

Code

poststrat["logit_pred"] = fit.epred(poststrat[["pid"]]).mean(axis=0)
logit_poststrat = np.average(poststrat["logit_pred"], weights=poststrat["N"])

pd.concat([
    poststrat[["pid", "N", "cell_mean", "logit_pred"]],
    pd.DataFrame({"pid": ["weighted total"], "N": [poststrat["N"].sum()], "cell_mean": [np.average(poststrat["cell_mean"], weights=poststrat["N"])], "logit_pred": [logit_poststrat]})
], ignore_index=True).round(3)

	pid	N	cell_mean	logit_pred
0	Republican	0.33	0.907	0.902
1	Democrat	0.36	0.053	0.061
2	Independent	0.31	0.486	0.484
3	weighted total	1.00	0.469	0.470

Posterior uncertainty

rstanarm::posterior_epred() averages posterior predictions over the poststratification table. The helper fit exposes the same kind of expected-prediction draws directly.

Code

pred_draws = fit.epred(poststrat[["pid"]])
poststrat_draws = pred_draws @ (poststrat["N"].to_numpy() / poststrat["N"].sum())

pd.Series({
    "mean": poststrat_draws.mean(),
    "sd": poststrat_draws.std(ddof=1),
    "2.5%": np.quantile(poststrat_draws, 0.025),
    "97.5%": np.quantile(poststrat_draws, 0.975),
}).round(3)

mean     0.470
sd       0.015
2.5%     0.440
97.5%    0.498
dtype: float64

The original page also adds a rough extra uncertainty term of 0.02 to represent uncertainty in the target population proportions. That uncertainty is not estimated from this tiny example; it is an explicit modeling choice about the poststratification frame.

Code

poststrat_draws_with_frame_error = poststrat_draws + rng.normal(0, 0.02, size=len(poststrat_draws))
pd.Series({
    "mean": poststrat_draws_with_frame_error.mean(),
    "sd": poststrat_draws_with_frame_error.std(ddof=1),
}).round(3)

mean    0.470
sd      0.025
dtype: float64

# Poststratification with party identification Source: `Poststrat/poststrat.Rmd` This example reproduces the chapter's small poststratification calculation for a 2016 CBS poll. The raw poll has unequal partisan composition; poststratification replaces the sample party-ID mix with an assumed population mix and averages predicted Trump support over those cells. ## Setup and simulated poll ```{python} from pathlib import Path import numpy as np import pandas as pd from python.bayes_glm import bayes_logit rng = np.random.default_rng(1701) ``` The R source builds a poll from the reported effective sample sizes and within-party candidate preferences. `vote = 1` denotes Trump support, `vote = 0` denotes Clinton support, and `NaN` collects respondents who preferred neither major-party candidate. ```{python} pid_names = ["Republican", "Democrat", "Independent"] n_pid = np.array([254, 282, 242]) major_party_rates = { "Republican": (0.77, 0.08), "Democrat": (0.05, 0.89), "Independent": (0.36, 0.38), } rows = [] cell_means = [] for name, n in zip(pid_names, n_pid): n_trump, n_clinton = np.round(np.array(major_party_rates[name]) * n).astype(int) n_other = int(n - n_trump - n_clinton) votes = np.r_[np.ones(n_trump), np.zeros(n_clinton), np.repeat(np.nan, n_other)] rows.extend({"pid": name, "vote": v} for v in votes) cell_means.append(np.nanmean(votes)) poll = pd.DataFrame(rows) cell_summary = ( poll.groupby("pid", sort=False)["vote"] .agg(n="size", major_party_n="count", trump_share="mean") .reindex(pid_names) ) cell_summary.round(3) ``` ## Direct and poststratified estimates The direct estimate simply averages the poll respondents with a major-party preference. The poststratified estimate averages the three party-specific means using the target population distribution from the R page. ```{python} poststrat = pd.DataFrame({ "pid": pid_names, "N": [0.33, 0.36, 0.31], "cell_mean": cell_means, }) pd.Series({ "raw poll mean": poll["vote"].mean(skipna=True), "poststratified mean": np.average(poststrat["cell_mean"], weights=poststrat["N"]), }).round(3) ``` The adjustment is transparent here: Republicans are somewhat overrepresented relative to the target population while Democrats are underrepresented, so the target-weighted Trump share is lower than the raw sample mean. ```{python} poststrat.round(3) ``` ## Logistic regression version The R code fits `stan_glm(vote ~ factor(pid), family=binomial)`. Use a Bayesian logistic regression with party indicators and weak Normal priors; with a saturated categorical predictor its posterior fitted probabilities are shrunk versions of the party-specific sample proportions. ```{python} poll_observed = poll.dropna(subset=["vote"]).copy() fit = bayes_logit("vote ~ C(pid)", data=poll_observed, draws=4000, prior_scale=2.5, seed=1701) fit.summary().round(3) ``` ```{python} poststrat["logit_pred"] = fit.epred(poststrat[["pid"]]).mean(axis=0) logit_poststrat = np.average(poststrat["logit_pred"], weights=poststrat["N"]) pd.concat([ poststrat[["pid", "N", "cell_mean", "logit_pred"]], pd.DataFrame({"pid": ["weighted total"], "N": [poststrat["N"].sum()], "cell_mean": [np.average(poststrat["cell_mean"], weights=poststrat["N"])], "logit_pred": [logit_poststrat]}) ], ignore_index=True).round(3) ``` ## Posterior uncertainty `rstanarm::posterior_epred()` averages posterior predictions over the poststratification table. The helper fit exposes the same kind of expected-prediction draws directly. ```{python} pred_draws = fit.epred(poststrat[["pid"]]) poststrat_draws = pred_draws @ (poststrat["N"].to_numpy() / poststrat["N"].sum()) pd.Series({ "mean": poststrat_draws.mean(), "sd": poststrat_draws.std(ddof=1), "2.5%": np.quantile(poststrat_draws, 0.025), "97.5%": np.quantile(poststrat_draws, 0.975), }).round(3) ``` The original page also adds a rough extra uncertainty term of 0.02 to represent uncertainty in the target population proportions. That uncertainty is not estimated from this tiny example; it is an explicit modeling choice about the poststratification frame. ```{python} poststrat_draws_with_frame_error = poststrat_draws + rng.normal(0, 0.02, size=len(poststrat_draws)) pd.Series({ "mean": poststrat_draws_with_frame_error.mean(), "sd": poststrat_draws_with_frame_error.std(ddof=1), }).round(3) ```