Code
import numpy as np
import pandas as pd
from scipy.special import expit
from python.bayes_glm import bayes_logit
rng = np.random.default_rng(1702)Source: Poststrat/poststrat2.Rmd
This page follows the R example that creates a small synthetic population table, samples respondents with differential nonresponse, fits a response model to the survey, and then averages predictions over the full population table.
import numpy as np
import pandas as pd
from scipy.special import expit
from python.bayes_glm import bayes_logit
rng = np.random.default_rng(1702)The population is cross-classified by sex, age group, and ethnicity: 2 x 4 x 4 = 32 cells. The cell counts are artificial but intentionally imbalanced.
poststrat = pd.MultiIndex.from_product(
[range(1, 3), range(1, 5), range(1, 5)],
names=["sex", "age", "eth"],
).to_frame(index=False)
p_sex = np.array([0.52, 0.48])
p_age = np.array([0.20, 0.25, 0.30, 0.25])
p_eth = np.array([0.70, 0.10, 0.10, 0.10])
poststrat["N"] = (
250e6
* p_sex[poststrat["sex"].to_numpy() - 1]
* p_age[poststrat["age"].to_numpy() - 1]
* p_eth[poststrat["eth"].to_numpy() - 1]
)
poststrat.head()| sex | age | eth | N | |
|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 18200000.0 |
| 1 | 1 | 1 | 2 | 2600000.0 |
| 2 | 1 | 1 | 3 | 2600000.0 |
| 3 | 1 | 1 | 4 | 2600000.0 |
| 4 | 1 | 2 | 1 | 22750000.0 |
Women, older people, and whites are assigned higher response probabilities. The sampled survey is therefore not distributed like the population.
p_response_baseline = 0.1
p_response_sex = np.array([1.0, 0.8])
p_response_age = np.array([1.0, 1.2, 1.6, 2.5])
p_response_eth = np.array([1.0, 0.8, 0.7, 0.6])
poststrat["p_response"] = (
p_response_baseline
* p_response_sex[poststrat["sex"].to_numpy() - 1]
* p_response_age[poststrat["age"].to_numpy() - 1]
* p_response_eth[poststrat["eth"].to_numpy() - 1]
)
n = 1000
sample_prob = poststrat["N"] * poststrat["p_response"]
sample_prob = sample_prob / sample_prob.sum()
people = rng.choice(poststrat.index.to_numpy(), size=n, replace=True, p=sample_prob)
sample_check = poststrat.copy()
sample_check["sample_share"] = np.bincount(people, minlength=len(poststrat)) / n
sample_check["population_share"] = poststrat["N"] / poststrat["N"].sum()
sample_check.head(10).round(3)| sex | age | eth | N | p_response | sample_share | population_share | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 18200000.0 | 0.100 | 0.050 | 0.073 |
| 1 | 1 | 1 | 2 | 2600000.0 | 0.080 | 0.011 | 0.010 |
| 2 | 1 | 1 | 3 | 2600000.0 | 0.070 | 0.003 | 0.010 |
| 3 | 1 | 1 | 4 | 2600000.0 | 0.060 | 0.007 | 0.010 |
| 4 | 1 | 2 | 1 | 22750000.0 | 0.120 | 0.075 | 0.091 |
| 5 | 1 | 2 | 2 | 3250000.0 | 0.096 | 0.014 | 0.013 |
| 6 | 1 | 2 | 3 | 3250000.0 | 0.084 | 0.008 | 0.013 |
| 7 | 1 | 2 | 4 | 3250000.0 | 0.072 | 0.009 | 0.013 |
| 8 | 1 | 3 | 1 | 27300000.0 | 0.160 | 0.141 | 0.109 |
| 9 | 1 | 3 | 2 | 3900000.0 | 0.128 | 0.020 | 0.016 |
The true outcome model is a logistic regression with additive effects for the three poststratification variables.
coef_intercept = 0.6
coef_sex = np.array([0.0, -0.2])
coef_age = np.array([0.0, -0.2, -0.3, -0.4])
coef_eth = np.array([0.0, 0.6, 0.3, 0.3])
eta_cells = (
coef_intercept
+ coef_sex[poststrat["sex"].to_numpy() - 1]
+ coef_age[poststrat["age"].to_numpy() - 1]
+ coef_eth[poststrat["eth"].to_numpy() - 1]
)
poststrat["prob_yes"] = expit(eta_cells)
y = rng.binomial(1, poststrat.loc[people, "prob_yes"].to_numpy())
fake = poststrat.loc[people, ["sex", "age", "eth"]].reset_index(drop=True)
fake["y"] = y
fake.head()| sex | age | eth | y | |
|---|---|---|---|---|
| 0 | 2 | 2 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 |
| 2 | 1 | 2 | 1 | 0 |
| 3 | 1 | 2 | 3 | 1 |
| 4 | 1 | 2 | 4 | 1 |
The R source uses stan_glm(..., family=binomial). Here we fit the same additive logistic regression with weak Normal priors using the lightweight Bayesian helper.
fit = bayes_logit("y ~ C(sex) + C(age) + C(eth)", data=fake, draws=4000, prior_scale=2.5, seed=1702)
fit.summary().round(3)| mean | sd | q5 | q50 | q95 | |
|---|---|---|---|---|---|
| Intercept | 0.773 | 0.206 | 0.433 | 0.774 | 1.111 |
| C(sex)[T.2] | -0.533 | 0.131 | -0.751 | -0.531 | -0.317 |
| C(age)[T.2] | -0.129 | 0.240 | -0.519 | -0.123 | 0.257 |
| C(age)[T.3] | -0.262 | 0.229 | -0.646 | -0.261 | 0.106 |
| C(age)[T.4] | -0.354 | 0.221 | -0.712 | -0.353 | 0.011 |
| C(eth)[T.2] | 0.600 | 0.229 | 0.222 | 0.606 | 0.980 |
| C(eth)[T.3] | -0.045 | 0.229 | -0.427 | -0.044 | 0.331 |
| C(eth)[T.4] | -0.067 | 0.271 | -0.508 | -0.068 | 0.383 |
The fitted model is evaluated on every population cell. Comparing prob_yes and pred_est shows how much the finite survey has recovered about the true cell probabilities.
poststrat["pred_est"] = fit.epred(poststrat).mean(axis=0)
poststrat[["sex", "age", "eth", "N", "prob_yes", "pred_est"]].head(12).round(3)| sex | age | eth | N | prob_yes | pred_est | |
|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 18200000.0 | 0.646 | 0.683 |
| 1 | 1 | 1 | 2 | 2600000.0 | 0.769 | 0.794 |
| 2 | 1 | 1 | 3 | 2600000.0 | 0.711 | 0.671 |
| 3 | 1 | 1 | 4 | 2600000.0 | 0.711 | 0.666 |
| 4 | 1 | 2 | 1 | 22750000.0 | 0.599 | 0.655 |
| 5 | 1 | 2 | 2 | 3250000.0 | 0.731 | 0.773 |
| 6 | 1 | 2 | 3 | 3250000.0 | 0.668 | 0.643 |
| 7 | 1 | 2 | 4 | 3250000.0 | 0.668 | 0.638 |
| 8 | 1 | 3 | 1 | 27300000.0 | 0.574 | 0.625 |
| 9 | 1 | 3 | 2 | 3900000.0 | 0.711 | 0.750 |
| 10 | 1 | 3 | 3 | 3900000.0 | 0.646 | 0.613 |
| 11 | 1 | 3 | 4 | 3900000.0 | 0.646 | 0.607 |
pop_weights = poststrat["N"] / poststrat["N"].sum()
truth = np.average(poststrat["prob_yes"], weights=pop_weights)
raw_sample = fake["y"].mean()
poststrat_est = np.average(poststrat["pred_est"], weights=pop_weights)
pd.Series({
"population truth": truth,
"raw sample mean": raw_sample,
"poststratified estimate": poststrat_est,
}).round(3)population truth 0.593
raw sample mean 0.587
poststratified estimate 0.587
dtype: float64
The raw sample mean is biased toward the outcome distribution of the groups more likely to respond. Poststratification keeps the fitted conditional response model but swaps in the population cell frequencies.
Aggregate each posterior expected-prediction draw over the poststratification table.
pred_sim = fit.epred(poststrat)
poststrat_sim = pred_sim @ pop_weights.to_numpy()
pd.Series({
"mean": poststrat_sim.mean(),
"sd": poststrat_sim.std(ddof=1),
"2.5%": np.quantile(poststrat_sim, 0.025),
"97.5%": np.quantile(poststrat_sim, 0.975),
}).round(3)mean 0.587
sd 0.017
2.5% 0.553
97.5% 0.619
dtype: float64