Height and weight: prediction and indicators

Source: Earnings/height_and_weight.Rmd

This example predicts weight from height and then adds binary and categorical indicators.

Code
from pathlib import Path
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

root = Path("../../ROS-Examples")
earnings = pd.read_csv(root / "Earnings/data/earnings.csv")
earnings.head()
height weight male earn earnk ethnicity education mother_education father_education walk exercise smokenow tense angry age
0 74 210.0 1 50000.0 50.0 White 16.0 16.0 16.0 3 3 2.0 0.0 0.0 45
1 66 125.0 0 60000.0 60.0 White 16.0 16.0 16.0 6 5 1.0 0.0 0.0 58
2 64 126.0 0 30000.0 30.0 White 16.0 16.0 16.0 8 1 2.0 1.0 1.0 29
3 65 200.0 0 25000.0 25.0 White 17.0 17.0 NaN 8 1 2.0 0.0 0.0 57
4 63 110.0 0 50000.0 50.0 Other 16.0 16.0 16.0 5 6 2.0 0.0 0.0 91

Predict weight from height

Code
fit_1 = smf.ols("weight ~ height", data=earnings).fit()
fit_1.params
Intercept   -173.286452
height         4.949380
dtype: float64
Code
height_new = 66
pred_mean = fit_1.predict(pd.DataFrame({"height": [height_new]})).iloc[0]
print(round(pred_mean, 1))
153.4

A predictive interval must include both coefficient uncertainty and residual variation. statsmodels exposes both via get_prediction.

Code
fit_1.get_prediction(pd.DataFrame({"height": [66]})).summary_frame(alpha=0.1)
mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower obs_ci_upper
0 153.372642 0.692629 152.232777 154.512507 105.711817 201.033468

Center heights

Code
earnings["c_height"] = earnings["height"] - 66
fit_2 = smf.ols("weight ~ c_height", data=earnings).fit()
fit_2.params
Intercept    153.372642
c_height       4.949380
dtype: float64

The intercept is now the expected weight at 66 inches.

Code
fit_2.get_prediction(pd.DataFrame({"c_height": [4.0]})).summary_frame(alpha=0.1)
mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower obs_ci_upper
0 173.170163 0.915626 171.663311 174.677015 125.499149 220.841177

Add a binary indicator

Code
fit_3 = smf.ols("weight ~ c_height + male", data=earnings).fit()
fit_3.params
Intercept    149.535382
c_height       3.887891
male          11.837092
dtype: float64
Code
fit_3.get_prediction(pd.DataFrame({"c_height": [4.0], "male": [0]})).summary_frame(alpha=0.1)
mean mean_se mean_ci_lower mean_ci_upper obs_ci_lower obs_ci_upper
0 165.086947 1.630966 162.402856 167.771039 117.817595 212.3563

Add ethnicity as a categorical predictor

Code
fit_4 = smf.ols("weight ~ c_height + male + C(ethnicity)", data=earnings).fit()
fit_4.params
Intercept                   154.328493
C(ethnicity)[T.Hispanic]     -6.152312
C(ethnicity)[T.Other]       -12.261355
C(ethnicity)[T.White]        -5.191391
c_height                      3.852135
male                         12.106320
dtype: float64

Choose the baseline by setting categorical order:

Code
earnings["eth"] = pd.Categorical(earnings["ethnicity"], categories=["White", "Black", "Hispanic", "Other"])
fit_5 = smf.ols("weight ~ c_height + male + C(eth)", data=earnings).fit()
fit_5.params
Intercept             149.137102
C(eth)[T.Black]         5.191391
C(eth)[T.Hispanic]     -0.960921
C(eth)[T.Other]        -7.069964
c_height                3.852135
male                   12.106320
dtype: float64

CmdStanPy analog

The Stan model is the same Gaussian regression used elsewhere; only the design matrix changes. This is a good example where patsy can build X, then CmdStanPy samples the regression posterior.

Code
# import patsy
# y, X = patsy.dmatrices("weight ~ c_height + male + C(eth)", earnings, return_type="dataframe")
# data = {"N": X.shape[0], "K": X.shape[1], "X": X.to_numpy(), "y": y.to_numpy().ravel()}