Pew survey recodes and weighted income profiles

Source: Pew/pew.Rmd

The source page is a collection of data-cleaning and visualization steps for raw Pew 2008 election survey data. This port focuses on the core workflow: read the Stata file, recreate the vote-intention and weighting variables, recode income/party/ideology/religious-attendance scales, and reproduce the weighted income profiles.

Setup and data

Code

from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

root = Path("/Users/alal/tmp/ros-python-book/ROS-Examples")
pew_pre = pd.read_stata(root / "Pew/data/pew_research_center_june_elect_wknd_data.dta", convert_categoricals=True)
n = len(pew_pre)
pew_pre[["date", "heat2", "heat4", "regicert", "state", "weight", "income", "party", "partyln", "ideo"]].head()

	date	heat2	heat4	regicert	state	weight	income	party	partyln	ideo
0	62708	dem/lean dem	NaN	absolutely certain	massachusetts	1.326923	$50,000-$74,999	democrat	NaN	liberal
1	80208	other-dk	NaN	absolutely certain	massachusetts	0.822000	$100,000-$149,999	republican	NaN	moderate
2	73108	dem/lean dem	NaN	absolutely certain	massachusetts	0.493000	dk/refused	independent	lean democrat	conservative
3	80208	dem/lean dem	NaN	absolutely certain	massachusetts	0.492000	$30,000-$39,999	independent	other/dk	liberal
4	62108	dem/lean dem	NaN	absolutely certain	massachusetts	2.000000	$75,000-$99,999	democrat	NaN	moderate

Code

which_question = np.where(pew_pre["heat2"].notna(), 2, np.where(pew_pre["heat4"].notna(), 4, 0))
pd.crosstab(pew_pre["date"], which_question).head()

col_0	0	2	4
date
61808	26	202	0
61908	96	326	0
62008	66	274	0
62108	94	318	0
62208	82	294	0

Vote intention and registration

The R source creates rvote = 1 for Republican/lean-Republican vote intention and rvote = 0 for Democratic/lean-Democratic vote intention, using whichever of heat2 or heat4 was asked.

Code

def norm_text(s):
    return s.astype("string").str.lower().str.strip()

heat2 = norm_text(pew_pre["heat2"])
heat4 = norm_text(pew_pre["heat4"])
rvote = pd.Series(np.nan, index=pew_pre.index, dtype="float")
rvote = rvote.mask((which_question == 2) & heat2.eq("rep/lean rep"), 1)
rvote = rvote.mask((which_question == 2) & heat2.eq("dem/lean dem"), 0)
rvote = rvote.mask((which_question == 4) & heat4.eq("rep/lean rep"), 1)
rvote = rvote.mask((which_question == 4) & heat4.eq("dem/lean dem"), 0)

registered = norm_text(pew_pre["regicert"]).eq("absolutely certain").fillna(False).astype(int)
pd.Series({"observations": n, "major-party vote intentions": rvote.notna().sum(), "certain registered": registered.sum()})

observations                   31201
major-party vote intentions    23770
certain registered             26189
dtype: int64

Dates, poll IDs, and weights

The poll identifier follows the nested date cutoffs in the R code. Weights are normalized within each poll wave; voter_weight is only defined for registered respondents with a major-party vote intention.

Code

date = pew_pre["date"].astype(int)
month = date // 10000
day = date // 100 - 100 * month
poll_id = np.select(
    [
        month == 6,
        (month == 7) & (day < 28),
        ((month == 7) & (day >= 28)) | (month == 8),
        (month == 9) & (day < 16),
        (month == 9) & (day >= 16),
        (month == 10) & (day < 14),
        (month == 10) & (day >= 14) & (day < 20),
        (month == 10) & (day >= 20) & (day < 27),
    ],
    [1, 2, 3, 4, 5, 6, 7, 8],
    default=9,
)

pop_weight0 = pew_pre["weight"].astype(float)
voter_weight0 = pop_weight0.where(rvote.notna() & registered.eq(1))

pop_weight = pop_weight0 / pop_weight0.groupby(poll_id).transform("mean")
voter_weight = voter_weight0 / voter_weight0.groupby(poll_id).transform("mean")

pd.DataFrame({"poll_id": poll_id, "pop_weight": pop_weight, "voter_weight": voter_weight}).groupby("poll_id").mean().round(3)

	pop_weight	voter_weight
poll_id
1	1.0	1.0
2	1.0	1.0
3	1.0	1.0
4	1.0	1.0
5	1.0	1.0
6	1.0	1.0
7	1.0	1.0
8	1.0	1.0
9	1.0	1.0

Income, party identification, and ideology recodes

The labels in the Stata file are already readable strings, so the Python recodes map those labels directly rather than relying on R factor codes.

Code

income_order = [
    "less than $10,000", "$10,000-$19,999", "$20,000-$29,999", "$30,000-$39,999",
    "$40,000-$49,999", "$50,000-$74,999", "$75,000-$99,999", "$100,000-$149,999", "$150,000 or more",
]
income_map = {label: i + 1 for i, label in enumerate(income_order)}
inc = norm_text(pew_pre["income"]).map({k.lower(): v for k, v in income_map.items()}).astype("float")
value_inc = np.array([5, 15, 25, 35, 45, 62.5, 87.5, 125, 200])
n_inc = int(np.nanmax(inc))

party = norm_text(pew_pre["party"])
partyln = norm_text(pew_pre["partyln"])
pid = pd.Series(3, index=pew_pre.index, dtype="int")
pid = pid.mask(party.eq("democrat"), 1)
pid = pid.mask(party.eq("republican"), 5)
pid = pid.mask(party.eq("independent") & partyln.eq("lean democrat"), 2)
pid = pid.mask(party.eq("independent") & partyln.eq("lean republican"), 4)
pid_label = ["Democrat", "Lean Dem.", "Independent", "Lean Rep.", "Republican"]

ideo = norm_text(pew_pre["ideo"])
ideology = pd.Series(3, index=pew_pre.index, dtype="int")
ideology = ideology.mask(ideo.eq("very liberal"), 1)
ideology = ideology.mask(ideo.eq("liberal"), 2)
ideology = ideology.mask(ideo.eq("conservative"), 4)
ideology = ideology.mask(ideo.eq("very conservative"), 5)
ideology_label = ["Very liberal", "Liberal", "Moderate", "Conservative", "Very conservative"]

pd.DataFrame({"income": inc, "pid": pid, "ideology": ideology}).describe().round(2)

	income	pid	ideology
count	21226.00	31201.00	31201.00
mean	5.00	2.87	3.26
std	2.26	1.65	0.97
min	1.00	1.00	1.00
25%	3.00	1.00	3.00
50%	6.00	3.00	3.00
75%	7.00	5.00	4.00
max	8.00	5.00	5.00

Weighted cumulative profiles by income

The base-R plots use cumulative weighted proportions to draw stacked bands. The helper below mirrors weighted.mean(indicator, weight, na.rm=TRUE).

Code

def weighted_mean_bool(mask, weights):
    ok = mask.notna() & weights.notna()
    if ok.sum() == 0 or weights[ok].sum() == 0:
        return np.nan
    return np.average(mask[ok].astype(float), weights=weights[ok])

def cumulative_profile(scores, n_scores, weights):
    out = np.empty((n_scores + 1, n_inc))
    out[-1, :] = 1.0
    for i in range(1, n_scores + 1):
        for j in range(1, n_inc + 1):
            out[i - 1, j - 1] = weighted_mean_bool((scores < i) & (inc == j), weights)
    return out

pid_profile = cumulative_profile(pid, 5, pop_weight0)
ideology_profile = cumulative_profile(ideology, 5, pop_weight0)

income_labels = income_order[:n_inc]
pd.DataFrame(pid_profile[:-1].T, columns=pid_label, index=income_labels).round(3).head()

	Lean Dem.	Independent	Lean Rep.	Republican
less than $10,000	0.030	0.042	0.054	0.058
$10,000-$19,999	0.037	0.050	0.065	0.073
$20,000-$29,999	0.042	0.057	0.071	0.081
$30,000-$39,999	0.038	0.052	0.066	0.075
$40,000-$49,999	0.000	0.000	0.000	0.000

Code

def draw_stacked_profile(profile, labels, title):
    x = np.arange(1, n_inc + 1)
    fig, ax = plt.subplots(figsize=(6, 4.5))
    grays = [str(0.35 + 0.09 * i) for i in range(len(labels))]
    for i, label in enumerate(labels):
        ax.fill_between(x, profile[i], profile[i + 1], color=grays[i], edgecolor="black", linewidth=0.6)
        center = n_inc // 2
        y = np.nanmean(profile[[i, i + 1], center])
        ax.text(x[center], y, label, ha="center", va="center", fontsize=8)
    ax.set(xlim=(1, n_inc), ylim=(0, 1), title=title)
    ax.set_xticks([1.5, 5, 8.5], ["Low income", "Middle income", "High income"])
    ax.set_yticks([0, 0.5, 1], ["0", "50%", "100%"])
    ax.spines[["top", "right"]].set_visible(False)
    return fig, ax

draw_stacked_profile(pid_profile, pid_label, "Self-declared party identification, by income")

(<Figure size 576x432 with 1 Axes>,
 <Axes: title={'center': 'Self-declared party identification, by income'}>)

Code

draw_stacked_profile(ideology_profile, ideology_label, "Self-declared political ideology, by income")

(<Figure size 576x432 with 1 Axes>,
 <Axes: title={'center': 'Self-declared political ideology, by income'}>)

Party-by-ideology income distributions

The final figure in the source groups respondents into three party categories and three ideology categories, then plots the weighted share of each combined category by income.

Code

pid2 = pd.Series(np.where(pid.eq(1), 1, np.where(pid.eq(5), 3, 2)), index=pew_pre.index)
ideology2 = pd.Series(np.where(ideology.isin([1, 2]), 1, np.where(ideology.eq(3), 2, 3)), index=pew_pre.index)
pid2_label = ["Democrats", "Independents", "Republicans"]
ideology2_label = ["Liberal", "Moderate", "Conservative"]

prop = np.empty((3, 3, n_inc))
for i in range(1, 4):
    for j in range(1, 4):
        for k in range(1, n_inc + 1):
            prop[i - 1, j - 1, k - 1] = weighted_mean_bool((pid2 == i) & (ideology2 == j) & (inc == k), pop_weight)

prop.round(4)[:, :, :3]

array([[[0.0087, 0.0102, 0.0117],
        [0.0107, 0.0135, 0.0174],
        [0.0109, 0.0124, 0.0128]],

       [[0.0077, 0.0083, 0.0091],
        [0.0128, 0.0188, 0.0192],
        [0.0098, 0.0104, 0.0119]],

       [[0.0012, 0.0011, 0.001 ],
        [0.0032, 0.0044, 0.0056],
        [0.0067, 0.011 , 0.0167]]])

Code

fig, axes = plt.subplots(3, 3, figsize=(8, 7), sharex=True, sharey=True)
x = np.arange(1, n_inc + 1)
ymax = np.nanmax(prop)
for i in range(3):
    for j in range(3):
        ax = axes[i, j]
        ax.plot(x, prop[i, j, :], color="black", lw=1)
        ax.set_title(f"{ideology2_label[j]} {pid2_label[i]}", fontsize=9)
        ax.set_ylim(0, ymax * 1.05)
        ax.set_xticks([2.7, 7.3], ["Low income", "High income"])
        ax.set_yticks([0, 0.01, 0.02], ["0", "1%", "2%"])
        ax.spines[["top", "right"]].set_visible(False)
fig.suptitle("Income distributions within self-reported political categories in 2008")
fig.tight_layout()

Religious attendance recode

The source also creates a five-level religious-attendance scale for later analyses. It is included here as a checked recode even though the displayed figures above do not use it.

Code

attend = norm_text(pew_pre["attend"])
relatt_map = {
    "never": 1,
    "seldom": 1,
    "a few times a year": 2,
    "once or twice a month": 3,
    "once a week": 4,
    "more than once a week": 5,
}
relatt = attend.map(relatt_map)
relatt_label = ["Nonattenders", "Rare attenders", "Occasional attenders", "Frequent attenders", "Very frequent church attenders"]
relatt.value_counts(dropna=False).sort_index()

attend
1.0    6618
2.0    4725
3.0    3994
4.0    7126
5.0    3431
NaN    5307
Name: count, dtype: int64

# Pew survey recodes and weighted income profiles Source: `Pew/pew.Rmd` The source page is a collection of data-cleaning and visualization steps for raw Pew 2008 election survey data. This port focuses on the core workflow: read the Stata file, recreate the vote-intention and weighting variables, recode income/party/ideology/religious-attendance scales, and reproduce the weighted income profiles. ## Setup and data ```{python} from pathlib import Path import numpy as np import pandas as pd import matplotlib.pyplot as plt root = Path("/Users/alal/tmp/ros-python-book/ROS-Examples") pew_pre = pd.read_stata(root / "Pew/data/pew_research_center_june_elect_wknd_data.dta", convert_categoricals=True) n = len(pew_pre) pew_pre[["date", "heat2", "heat4", "regicert", "state", "weight", "income", "party", "partyln", "ideo"]].head() ``` ```{python} which_question = np.where(pew_pre["heat2"].notna(), 2, np.where(pew_pre["heat4"].notna(), 4, 0)) pd.crosstab(pew_pre["date"], which_question).head() ``` ## Vote intention and registration The R source creates `rvote = 1` for Republican/lean-Republican vote intention and `rvote = 0` for Democratic/lean-Democratic vote intention, using whichever of `heat2` or `heat4` was asked. ```{python} def norm_text(s): return s.astype("string").str.lower().str.strip() heat2 = norm_text(pew_pre["heat2"]) heat4 = norm_text(pew_pre["heat4"]) rvote = pd.Series(np.nan, index=pew_pre.index, dtype="float") rvote = rvote.mask((which_question == 2) & heat2.eq("rep/lean rep"), 1) rvote = rvote.mask((which_question == 2) & heat2.eq("dem/lean dem"), 0) rvote = rvote.mask((which_question == 4) & heat4.eq("rep/lean rep"), 1) rvote = rvote.mask((which_question == 4) & heat4.eq("dem/lean dem"), 0) registered = norm_text(pew_pre["regicert"]).eq("absolutely certain").fillna(False).astype(int) pd.Series({"observations": n, "major-party vote intentions": rvote.notna().sum(), "certain registered": registered.sum()}) ``` ## Dates, poll IDs, and weights The poll identifier follows the nested date cutoffs in the R code. Weights are normalized within each poll wave; `voter_weight` is only defined for registered respondents with a major-party vote intention. ```{python} date = pew_pre["date"].astype(int) month = date // 10000 day = date // 100 - 100 * month poll_id = np.select( [ month == 6, (month == 7) & (day < 28), ((month == 7) & (day >= 28)) | (month == 8), (month == 9) & (day < 16), (month == 9) & (day >= 16), (month == 10) & (day < 14), (month == 10) & (day >= 14) & (day < 20), (month == 10) & (day >= 20) & (day < 27), ], [1, 2, 3, 4, 5, 6, 7, 8], default=9, ) pop_weight0 = pew_pre["weight"].astype(float) voter_weight0 = pop_weight0.where(rvote.notna() & registered.eq(1)) pop_weight = pop_weight0 / pop_weight0.groupby(poll_id).transform("mean") voter_weight = voter_weight0 / voter_weight0.groupby(poll_id).transform("mean") pd.DataFrame({"poll_id": poll_id, "pop_weight": pop_weight, "voter_weight": voter_weight}).groupby("poll_id").mean().round(3) ``` ## Income, party identification, and ideology recodes The labels in the Stata file are already readable strings, so the Python recodes map those labels directly rather than relying on R factor codes. ```{python} income_order = [ "less than $10,000", "$10,000-$19,999", "$20,000-$29,999", "$30,000-$39,999", "$40,000-$49,999", "$50,000-$74,999", "$75,000-$99,999", "$100,000-$149,999", "$150,000 or more", ] income_map = {label: i + 1 for i, label in enumerate(income_order)} inc = norm_text(pew_pre["income"]).map({k.lower(): v for k, v in income_map.items()}).astype("float") value_inc = np.array([5, 15, 25, 35, 45, 62.5, 87.5, 125, 200]) n_inc = int(np.nanmax(inc)) party = norm_text(pew_pre["party"]) partyln = norm_text(pew_pre["partyln"]) pid = pd.Series(3, index=pew_pre.index, dtype="int") pid = pid.mask(party.eq("democrat"), 1) pid = pid.mask(party.eq("republican"), 5) pid = pid.mask(party.eq("independent") & partyln.eq("lean democrat"), 2) pid = pid.mask(party.eq("independent") & partyln.eq("lean republican"), 4) pid_label = ["Democrat", "Lean Dem.", "Independent", "Lean Rep.", "Republican"] ideo = norm_text(pew_pre["ideo"]) ideology = pd.Series(3, index=pew_pre.index, dtype="int") ideology = ideology.mask(ideo.eq("very liberal"), 1) ideology = ideology.mask(ideo.eq("liberal"), 2) ideology = ideology.mask(ideo.eq("conservative"), 4) ideology = ideology.mask(ideo.eq("very conservative"), 5) ideology_label = ["Very liberal", "Liberal", "Moderate", "Conservative", "Very conservative"] pd.DataFrame({"income": inc, "pid": pid, "ideology": ideology}).describe().round(2) ``` ## Weighted cumulative profiles by income The base-R plots use cumulative weighted proportions to draw stacked bands. The helper below mirrors `weighted.mean(indicator, weight, na.rm=TRUE)`. ```{python} def weighted_mean_bool(mask, weights): ok = mask.notna() & weights.notna() if ok.sum() == 0 or weights[ok].sum() == 0: return np.nan return np.average(mask[ok].astype(float), weights=weights[ok]) def cumulative_profile(scores, n_scores, weights): out = np.empty((n_scores + 1, n_inc)) out[-1, :] = 1.0 for i in range(1, n_scores + 1): for j in range(1, n_inc + 1): out[i - 1, j - 1] = weighted_mean_bool((scores < i) & (inc == j), weights) return out pid_profile = cumulative_profile(pid, 5, pop_weight0) ideology_profile = cumulative_profile(ideology, 5, pop_weight0) income_labels = income_order[:n_inc] pd.DataFrame(pid_profile[:-1].T, columns=pid_label, index=income_labels).round(3).head() ``` ```{python} def draw_stacked_profile(profile, labels, title): x = np.arange(1, n_inc + 1) fig, ax = plt.subplots(figsize=(6, 4.5)) grays = [str(0.35 + 0.09 * i) for i in range(len(labels))] for i, label in enumerate(labels): ax.fill_between(x, profile[i], profile[i + 1], color=grays[i], edgecolor="black", linewidth=0.6) center = n_inc // 2 y = np.nanmean(profile[[i, i + 1], center]) ax.text(x[center], y, label, ha="center", va="center", fontsize=8) ax.set(xlim=(1, n_inc), ylim=(0, 1), title=title) ax.set_xticks([1.5, 5, 8.5], ["Low income", "Middle income", "High income"]) ax.set_yticks([0, 0.5, 1], ["0", "50%", "100%"]) ax.spines[["top", "right"]].set_visible(False) return fig, ax draw_stacked_profile(pid_profile, pid_label, "Self-declared party identification, by income") ``` ```{python} draw_stacked_profile(ideology_profile, ideology_label, "Self-declared political ideology, by income") ``` ## Party-by-ideology income distributions The final figure in the source groups respondents into three party categories and three ideology categories, then plots the weighted share of each combined category by income. ```{python} pid2 = pd.Series(np.where(pid.eq(1), 1, np.where(pid.eq(5), 3, 2)), index=pew_pre.index) ideology2 = pd.Series(np.where(ideology.isin([1, 2]), 1, np.where(ideology.eq(3), 2, 3)), index=pew_pre.index) pid2_label = ["Democrats", "Independents", "Republicans"] ideology2_label = ["Liberal", "Moderate", "Conservative"] prop = np.empty((3, 3, n_inc)) for i in range(1, 4): for j in range(1, 4): for k in range(1, n_inc + 1): prop[i - 1, j - 1, k - 1] = weighted_mean_bool((pid2 == i) & (ideology2 == j) & (inc == k), pop_weight) prop.round(4)[:, :, :3] ``` ```{python} fig, axes = plt.subplots(3, 3, figsize=(8, 7), sharex=True, sharey=True) x = np.arange(1, n_inc + 1) ymax = np.nanmax(prop) for i in range(3): for j in range(3): ax = axes[i, j] ax.plot(x, prop[i, j, :], color="black", lw=1) ax.set_title(f"{ideology2_label[j]} {pid2_label[i]}", fontsize=9) ax.set_ylim(0, ymax * 1.05) ax.set_xticks([2.7, 7.3], ["Low income", "High income"]) ax.set_yticks([0, 0.01, 0.02], ["0", "1%", "2%"]) ax.spines[["top", "right"]].set_visible(False) fig.suptitle("Income distributions within self-reported political categories in 2008") fig.tight_layout() ``` ## Religious attendance recode The source also creates a five-level religious-attendance scale for later analyses. It is included here as a checked recode even though the displayed figures above do not use it. ```{python} attend = norm_text(pew_pre["attend"]) relatt_map = { "never": 1, "seldom": 1, "a few times a year": 2, "once or twice a month": 3, "once a week": 4, "more than once a week": 5, } relatt = attend.map(relatt_map) relatt_label = ["Nonattenders", "Rare attenders", "Occasional attenders", "Frequent attenders", "Very frequent church attenders"] relatt.value_counts(dropna=False).sort_index() ```