Poststratification with simulated census and poll data

Source: Poststrat/poststrat2.Rmd

This page follows the R example that creates a small synthetic population table, samples respondents with differential nonresponse, fits a response model to the survey, and then averages predictions over the full population table.

Setup

Code
import numpy as np
import pandas as pd
from scipy.special import expit
from python.bayes_glm import bayes_logit

rng = np.random.default_rng(1702)

Poststratification table

The population is cross-classified by sex, age group, and ethnicity: 2 x 4 x 4 = 32 cells. The cell counts are artificial but intentionally imbalanced.

Code
poststrat = pd.MultiIndex.from_product(
    [range(1, 3), range(1, 5), range(1, 5)],
    names=["sex", "age", "eth"],
).to_frame(index=False)

p_sex = np.array([0.52, 0.48])
p_age = np.array([0.20, 0.25, 0.30, 0.25])
p_eth = np.array([0.70, 0.10, 0.10, 0.10])

poststrat["N"] = (
    250e6
    * p_sex[poststrat["sex"].to_numpy() - 1]
    * p_age[poststrat["age"].to_numpy() - 1]
    * p_eth[poststrat["eth"].to_numpy() - 1]
)
poststrat.head()
sex age eth N
0 1 1 1 18200000.0
1 1 1 2 2600000.0
2 1 1 3 2600000.0
3 1 1 4 2600000.0
4 1 2 1 22750000.0

Differential response and sample composition

Women, older people, and whites are assigned higher response probabilities. The sampled survey is therefore not distributed like the population.

Code
p_response_baseline = 0.1
p_response_sex = np.array([1.0, 0.8])
p_response_age = np.array([1.0, 1.2, 1.6, 2.5])
p_response_eth = np.array([1.0, 0.8, 0.7, 0.6])

poststrat["p_response"] = (
    p_response_baseline
    * p_response_sex[poststrat["sex"].to_numpy() - 1]
    * p_response_age[poststrat["age"].to_numpy() - 1]
    * p_response_eth[poststrat["eth"].to_numpy() - 1]
)

n = 1000
sample_prob = poststrat["N"] * poststrat["p_response"]
sample_prob = sample_prob / sample_prob.sum()
people = rng.choice(poststrat.index.to_numpy(), size=n, replace=True, p=sample_prob)

sample_check = poststrat.copy()
sample_check["sample_share"] = np.bincount(people, minlength=len(poststrat)) / n
sample_check["population_share"] = poststrat["N"] / poststrat["N"].sum()
sample_check.head(10).round(3)
sex age eth N p_response sample_share population_share
0 1 1 1 18200000.0 0.100 0.050 0.073
1 1 1 2 2600000.0 0.080 0.011 0.010
2 1 1 3 2600000.0 0.070 0.003 0.010
3 1 1 4 2600000.0 0.060 0.007 0.010
4 1 2 1 22750000.0 0.120 0.075 0.091
5 1 2 2 3250000.0 0.096 0.014 0.013
6 1 2 3 3250000.0 0.084 0.008 0.013
7 1 2 4 3250000.0 0.072 0.009 0.013
8 1 3 1 27300000.0 0.160 0.141 0.109
9 1 3 2 3900000.0 0.128 0.020 0.016

Simulate survey outcomes

The true outcome model is a logistic regression with additive effects for the three poststratification variables.

Code
coef_intercept = 0.6
coef_sex = np.array([0.0, -0.2])
coef_age = np.array([0.0, -0.2, -0.3, -0.4])
coef_eth = np.array([0.0, 0.6, 0.3, 0.3])

eta_cells = (
    coef_intercept
    + coef_sex[poststrat["sex"].to_numpy() - 1]
    + coef_age[poststrat["age"].to_numpy() - 1]
    + coef_eth[poststrat["eth"].to_numpy() - 1]
)
poststrat["prob_yes"] = expit(eta_cells)

y = rng.binomial(1, poststrat.loc[people, "prob_yes"].to_numpy())
fake = poststrat.loc[people, ["sex", "age", "eth"]].reset_index(drop=True)
fake["y"] = y
fake.head()
sex age eth y
0 2 2 1 1
1 1 1 1 1
2 1 2 1 0
3 1 2 3 1
4 1 2 4 1

Fit the survey model

The R source uses stan_glm(..., family=binomial). Here we fit the same additive logistic regression with weak Normal priors using the lightweight Bayesian helper.

Code
fit = bayes_logit("y ~ C(sex) + C(age) + C(eth)", data=fake, draws=4000, prior_scale=2.5, seed=1702)

fit.summary().round(3)
mean sd q5 q50 q95
Intercept 0.773 0.206 0.433 0.774 1.111
C(sex)[T.2] -0.533 0.131 -0.751 -0.531 -0.317
C(age)[T.2] -0.129 0.240 -0.519 -0.123 0.257
C(age)[T.3] -0.262 0.229 -0.646 -0.261 0.106
C(age)[T.4] -0.354 0.221 -0.712 -0.353 0.011
C(eth)[T.2] 0.600 0.229 0.222 0.606 0.980
C(eth)[T.3] -0.045 0.229 -0.427 -0.044 0.331
C(eth)[T.4] -0.067 0.271 -0.508 -0.068 0.383

Prediction over all poststratification cells

The fitted model is evaluated on every population cell. Comparing prob_yes and pred_est shows how much the finite survey has recovered about the true cell probabilities.

Code
poststrat["pred_est"] = fit.epred(poststrat).mean(axis=0)
poststrat[["sex", "age", "eth", "N", "prob_yes", "pred_est"]].head(12).round(3)
sex age eth N prob_yes pred_est
0 1 1 1 18200000.0 0.646 0.683
1 1 1 2 2600000.0 0.769 0.794
2 1 1 3 2600000.0 0.711 0.671
3 1 1 4 2600000.0 0.711 0.666
4 1 2 1 22750000.0 0.599 0.655
5 1 2 2 3250000.0 0.731 0.773
6 1 2 3 3250000.0 0.668 0.643
7 1 2 4 3250000.0 0.668 0.638
8 1 3 1 27300000.0 0.574 0.625
9 1 3 2 3900000.0 0.711 0.750
10 1 3 3 3900000.0 0.646 0.613
11 1 3 4 3900000.0 0.646 0.607
Code
pop_weights = poststrat["N"] / poststrat["N"].sum()
truth = np.average(poststrat["prob_yes"], weights=pop_weights)
raw_sample = fake["y"].mean()
poststrat_est = np.average(poststrat["pred_est"], weights=pop_weights)

pd.Series({
    "population truth": truth,
    "raw sample mean": raw_sample,
    "poststratified estimate": poststrat_est,
}).round(3)
population truth           0.593
raw sample mean            0.587
poststratified estimate    0.587
dtype: float64

The raw sample mean is biased toward the outcome distribution of the groups more likely to respond. Poststratification keeps the fitted conditional response model but swaps in the population cell frequencies.

Posterior uncertainty

Aggregate each posterior expected-prediction draw over the poststratification table.

Code
pred_sim = fit.epred(poststrat)
poststrat_sim = pred_sim @ pop_weights.to_numpy()

pd.Series({
    "mean": poststrat_sim.mean(),
    "sd": poststrat_sim.std(ddof=1),
    "2.5%": np.quantile(poststrat_sim, 0.025),
    "97.5%": np.quantile(poststrat_sim, 0.975),
}).round(3)
mean     0.587
sd       0.017
2.5%     0.553
97.5%    0.619
dtype: float64