Poststratification with simulated census and poll data

Source: Poststrat/poststrat2.Rmd

This page follows the R example that creates a small synthetic population table, samples respondents with differential nonresponse, fits a response model to the survey, and then averages predictions over the full population table.

Setup

Code

import numpy as np
import pandas as pd
from scipy.special import expit
from python.bayes_glm import bayes_logit

rng = np.random.default_rng(1702)

Poststratification table

The population is cross-classified by sex, age group, and ethnicity: 2 x 4 x 4 = 32 cells. The cell counts are artificial but intentionally imbalanced.

Code

poststrat = pd.MultiIndex.from_product(
    [range(1, 3), range(1, 5), range(1, 5)],
    names=["sex", "age", "eth"],
).to_frame(index=False)

p_sex = np.array([0.52, 0.48])
p_age = np.array([0.20, 0.25, 0.30, 0.25])
p_eth = np.array([0.70, 0.10, 0.10, 0.10])

poststrat["N"] = (
    250e6
    * p_sex[poststrat["sex"].to_numpy() - 1]
    * p_age[poststrat["age"].to_numpy() - 1]
    * p_eth[poststrat["eth"].to_numpy() - 1]
)
poststrat.head()

	sex	age	eth	N
0	1	1	1	18200000.0
1	1	1	2	2600000.0
2	1	1	3	2600000.0
3	1	1	4	2600000.0
4	1	2	1	22750000.0

Differential response and sample composition

Women, older people, and whites are assigned higher response probabilities. The sampled survey is therefore not distributed like the population.

Code

p_response_baseline = 0.1
p_response_sex = np.array([1.0, 0.8])
p_response_age = np.array([1.0, 1.2, 1.6, 2.5])
p_response_eth = np.array([1.0, 0.8, 0.7, 0.6])

poststrat["p_response"] = (
    p_response_baseline
    * p_response_sex[poststrat["sex"].to_numpy() - 1]
    * p_response_age[poststrat["age"].to_numpy() - 1]
    * p_response_eth[poststrat["eth"].to_numpy() - 1]
)

n = 1000
sample_prob = poststrat["N"] * poststrat["p_response"]
sample_prob = sample_prob / sample_prob.sum()
people = rng.choice(poststrat.index.to_numpy(), size=n, replace=True, p=sample_prob)

sample_check = poststrat.copy()
sample_check["sample_share"] = np.bincount(people, minlength=len(poststrat)) / n
sample_check["population_share"] = poststrat["N"] / poststrat["N"].sum()
sample_check.head(10).round(3)

	sex	age	eth	N	p_response	sample_share	population_share
0	1	1	1	18200000.0	0.100	0.050	0.073
1	1	1	2	2600000.0	0.080	0.011	0.010
2	1	1	3	2600000.0	0.070	0.003	0.010
3	1	1	4	2600000.0	0.060	0.007	0.010
4	1	2	1	22750000.0	0.120	0.075	0.091
5	1	2	2	3250000.0	0.096	0.014	0.013
6	1	2	3	3250000.0	0.084	0.008	0.013
7	1	2	4	3250000.0	0.072	0.009	0.013
8	1	3	1	27300000.0	0.160	0.141	0.109
9	1	3	2	3900000.0	0.128	0.020	0.016

Simulate survey outcomes

The true outcome model is a logistic regression with additive effects for the three poststratification variables.

Code

coef_intercept = 0.6
coef_sex = np.array([0.0, -0.2])
coef_age = np.array([0.0, -0.2, -0.3, -0.4])
coef_eth = np.array([0.0, 0.6, 0.3, 0.3])

eta_cells = (
    coef_intercept
    + coef_sex[poststrat["sex"].to_numpy() - 1]
    + coef_age[poststrat["age"].to_numpy() - 1]
    + coef_eth[poststrat["eth"].to_numpy() - 1]
)
poststrat["prob_yes"] = expit(eta_cells)

y = rng.binomial(1, poststrat.loc[people, "prob_yes"].to_numpy())
fake = poststrat.loc[people, ["sex", "age", "eth"]].reset_index(drop=True)
fake["y"] = y
fake.head()

	sex	age	eth	y
0	2	2	1	1
1	1	1	1	1
2	1	2	1	0
3	1	2	3	1
4	1	2	4	1

Fit the survey model

The R source uses stan_glm(..., family=binomial). Here we fit the same additive logistic regression with weak Normal priors using the lightweight Bayesian helper.

Code

fit = bayes_logit("y ~ C(sex) + C(age) + C(eth)", data=fake, draws=4000, prior_scale=2.5, seed=1702)

fit.summary().round(3)

	mean	sd	q5	q50	q95
Intercept	0.773	0.206	0.433	0.774	1.111
C(sex)[T.2]	-0.533	0.131	-0.751	-0.531	-0.317
C(age)[T.2]	-0.129	0.240	-0.519	-0.123	0.257
C(age)[T.3]	-0.262	0.229	-0.646	-0.261	0.106
C(age)[T.4]	-0.354	0.221	-0.712	-0.353	0.011
C(eth)[T.2]	0.600	0.229	0.222	0.606	0.980
C(eth)[T.3]	-0.045	0.229	-0.427	-0.044	0.331
C(eth)[T.4]	-0.067	0.271	-0.508	-0.068	0.383

Prediction over all poststratification cells

The fitted model is evaluated on every population cell. Comparing prob_yes and pred_est shows how much the finite survey has recovered about the true cell probabilities.

Code

poststrat["pred_est"] = fit.epred(poststrat).mean(axis=0)
poststrat[["sex", "age", "eth", "N", "prob_yes", "pred_est"]].head(12).round(3)

	sex	age	eth	N	prob_yes	pred_est
0	1	1	1	18200000.0	0.646	0.683
1	1	1	2	2600000.0	0.769	0.794
2	1	1	3	2600000.0	0.711	0.671
3	1	1	4	2600000.0	0.711	0.666
4	1	2	1	22750000.0	0.599	0.655
5	1	2	2	3250000.0	0.731	0.773
6	1	2	3	3250000.0	0.668	0.643
7	1	2	4	3250000.0	0.668	0.638
8	1	3	1	27300000.0	0.574	0.625
9	1	3	2	3900000.0	0.711	0.750
10	1	3	3	3900000.0	0.646	0.613
11	1	3	4	3900000.0	0.646	0.607

Code

pop_weights = poststrat["N"] / poststrat["N"].sum()
truth = np.average(poststrat["prob_yes"], weights=pop_weights)
raw_sample = fake["y"].mean()
poststrat_est = np.average(poststrat["pred_est"], weights=pop_weights)

pd.Series({
    "population truth": truth,
    "raw sample mean": raw_sample,
    "poststratified estimate": poststrat_est,
}).round(3)

population truth           0.593
raw sample mean            0.587
poststratified estimate    0.587
dtype: float64

The raw sample mean is biased toward the outcome distribution of the groups more likely to respond. Poststratification keeps the fitted conditional response model but swaps in the population cell frequencies.

Posterior uncertainty

Aggregate each posterior expected-prediction draw over the poststratification table.

Code

pred_sim = fit.epred(poststrat)
poststrat_sim = pred_sim @ pop_weights.to_numpy()

pd.Series({
    "mean": poststrat_sim.mean(),
    "sd": poststrat_sim.std(ddof=1),
    "2.5%": np.quantile(poststrat_sim, 0.025),
    "97.5%": np.quantile(poststrat_sim, 0.975),
}).round(3)

mean     0.587
sd       0.017
2.5%     0.553
97.5%    0.619
dtype: float64

# Poststratification with simulated census and poll data Source: `Poststrat/poststrat2.Rmd` This page follows the R example that creates a small synthetic population table, samples respondents with differential nonresponse, fits a response model to the survey, and then averages predictions over the full population table. ## Setup ```{python} import numpy as np import pandas as pd from scipy.special import expit from python.bayes_glm import bayes_logit rng = np.random.default_rng(1702) ``` ## Poststratification table The population is cross-classified by sex, age group, and ethnicity: `2 x 4 x 4 = 32` cells. The cell counts are artificial but intentionally imbalanced. ```{python} poststrat = pd.MultiIndex.from_product( [range(1, 3), range(1, 5), range(1, 5)], names=["sex", "age", "eth"], ).to_frame(index=False) p_sex = np.array([0.52, 0.48]) p_age = np.array([0.20, 0.25, 0.30, 0.25]) p_eth = np.array([0.70, 0.10, 0.10, 0.10]) poststrat["N"] = ( 250e6 * p_sex[poststrat["sex"].to_numpy() - 1] * p_age[poststrat["age"].to_numpy() - 1] * p_eth[poststrat["eth"].to_numpy() - 1] ) poststrat.head() ``` ## Differential response and sample composition Women, older people, and whites are assigned higher response probabilities. The sampled survey is therefore not distributed like the population. ```{python} p_response_baseline = 0.1 p_response_sex = np.array([1.0, 0.8]) p_response_age = np.array([1.0, 1.2, 1.6, 2.5]) p_response_eth = np.array([1.0, 0.8, 0.7, 0.6]) poststrat["p_response"] = ( p_response_baseline * p_response_sex[poststrat["sex"].to_numpy() - 1] * p_response_age[poststrat["age"].to_numpy() - 1] * p_response_eth[poststrat["eth"].to_numpy() - 1] ) n = 1000 sample_prob = poststrat["N"] * poststrat["p_response"] sample_prob = sample_prob / sample_prob.sum() people = rng.choice(poststrat.index.to_numpy(), size=n, replace=True, p=sample_prob) sample_check = poststrat.copy() sample_check["sample_share"] = np.bincount(people, minlength=len(poststrat)) / n sample_check["population_share"] = poststrat["N"] / poststrat["N"].sum() sample_check.head(10).round(3) ``` ## Simulate survey outcomes The true outcome model is a logistic regression with additive effects for the three poststratification variables. ```{python} coef_intercept = 0.6 coef_sex = np.array([0.0, -0.2]) coef_age = np.array([0.0, -0.2, -0.3, -0.4]) coef_eth = np.array([0.0, 0.6, 0.3, 0.3]) eta_cells = ( coef_intercept + coef_sex[poststrat["sex"].to_numpy() - 1] + coef_age[poststrat["age"].to_numpy() - 1] + coef_eth[poststrat["eth"].to_numpy() - 1] ) poststrat["prob_yes"] = expit(eta_cells) y = rng.binomial(1, poststrat.loc[people, "prob_yes"].to_numpy()) fake = poststrat.loc[people, ["sex", "age", "eth"]].reset_index(drop=True) fake["y"] = y fake.head() ``` ## Fit the survey model The R source uses `stan_glm(..., family=binomial)`. Here we fit the same additive logistic regression with weak Normal priors using the lightweight Bayesian helper. ```{python} fit = bayes_logit("y ~ C(sex) + C(age) + C(eth)", data=fake, draws=4000, prior_scale=2.5, seed=1702) fit.summary().round(3) ``` ## Prediction over all poststratification cells The fitted model is evaluated on every population cell. Comparing `prob_yes` and `pred_est` shows how much the finite survey has recovered about the true cell probabilities. ```{python} poststrat["pred_est"] = fit.epred(poststrat).mean(axis=0) poststrat[["sex", "age", "eth", "N", "prob_yes", "pred_est"]].head(12).round(3) ``` ```{python} pop_weights = poststrat["N"] / poststrat["N"].sum() truth = np.average(poststrat["prob_yes"], weights=pop_weights) raw_sample = fake["y"].mean() poststrat_est = np.average(poststrat["pred_est"], weights=pop_weights) pd.Series({ "population truth": truth, "raw sample mean": raw_sample, "poststratified estimate": poststrat_est, }).round(3) ``` The raw sample mean is biased toward the outcome distribution of the groups more likely to respond. Poststratification keeps the fitted conditional response model but swaps in the population cell frequencies. ## Posterior uncertainty Aggregate each posterior expected-prediction draw over the poststratification table. ```{python} pred_sim = fit.epred(poststrat) poststrat_sim = pred_sim @ pop_weights.to_numpy() pd.Series({ "mean": poststrat_sim.mean(), "sd": poststrat_sim.std(ddof=1), "2.5%": np.quantile(poststrat_sim, 0.025), "97.5%": np.quantile(poststrat_sim, 0.975), }).round(3) ```