Pollution and mortality data

Source: Pollution/pollution.Rmd

This page loads the McDonald and Schwing air-pollution/mortality data. The dataset is used later for regression diagnostics and coefficient stability.

Code
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt

root = Path("../../ROS-Examples")
pollution = pd.read_csv(root / "Pollution/data/pollution.csv")
pollution.head()
prec jant jult ovr65 popn educ hous dens nonw wwdrk poor hc nox so2 humid mort
0 36 27 71 8.1 3.34 11.4 81.5 3243 8.8 42.6 11.7 21 15 59 59 921.870
1 35 23 72 11.1 3.14 11.0 78.8 4281 3.5 50.7 14.4 8 10 39 57 997.875
2 44 29 74 10.4 3.21 9.8 81.6 4260 0.8 39.4 12.4 6 6 33 54 962.354
3 47 45 79 6.5 3.41 11.1 77.5 3125 27.1 50.2 20.6 18 8 24 56 982.291
4 43 35 77 7.6 3.44 9.6 84.6 6441 24.4 43.7 14.3 43 38 206 55 1071.289

Variable summaries

Code
pollution.describe().T
count mean std min 25% 50% 75% max
prec 60.0 37.366667 9.984678 10.000 32.750 38.000 43.25000 60.000
jant 60.0 33.983333 10.168899 12.000 27.000 31.500 40.00000 67.000
jult 60.0 74.583333 4.763177 63.000 72.000 74.000 77.25000 85.000
ovr65 60.0 8.798333 1.464552 5.600 7.675 9.000 9.70000 11.800
popn 60.0 3.263167 0.135252 2.920 3.210 3.265 3.36000 3.530
educ 60.0 10.973333 0.845299 9.000 10.400 11.050 11.50000 12.300
hous 60.0 80.913333 5.141373 66.800 78.375 81.150 83.60000 90.700
dens 60.0 3876.050000 1454.102361 1441.000 3104.250 3567.000 4519.75000 9699.000
nonw 60.0 11.870000 8.921148 0.800 4.950 10.400 15.65000 38.500
wwdrk 60.0 46.081667 4.613043 33.800 43.250 45.500 49.52500 59.700
poor 60.0 14.373333 4.160096 9.400 12.000 13.200 15.15000 26.400
hc 60.0 37.850000 91.977673 1.000 7.000 14.500 30.25000 648.000
nox 60.0 22.650000 46.333290 1.000 4.000 9.000 23.75000 319.000
so2 60.0 53.766667 63.390468 1.000 11.000 30.000 69.00000 278.000
humid 60.0 57.666667 5.369931 38.000 55.000 57.000 60.00000 73.000
mort 60.0 940.358433 62.206278 790.733 898.372 943.683 983.20575 1113.156

Pairwise associations with mortality

Code
target_candidates = [c for c in pollution.columns if c.lower() in {"mort", "mortality", "mortality_rate"}]
target = target_candidates[0] if target_candidates else pollution.select_dtypes('number').columns[0]
corr = pollution.select_dtypes('number').corr()[target].sort_values()
corr
educ    -0.510984
hous    -0.426819
wwdrk   -0.284802
hc      -0.177237
ovr65   -0.174602
humid   -0.088494
nox     -0.077379
jant    -0.030020
dens     0.265498
jult     0.277014
popn     0.357315
poor     0.410487
so2      0.425893
prec     0.509499
nonw     0.643747
mort     1.000000
Name: mort, dtype: float64
Code
num = pollution.select_dtypes('number')
cols = [target] + [c for c in corr.index[-5:] if c != target]
pd.plotting.scatter_matrix(num[cols], figsize=(8, 8), diagonal="hist")
plt.suptitle("Pollution-data scatter matrix", y=1.02)
Text(0.5, 1.02, 'Pollution-data scatter matrix')