Height and weight: prediction and indicators

Source: Earnings/height_and_weight.Rmd

This example predicts weight from height and then adds binary and categorical indicators.

Code

from pathlib import Path
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
import matplotlib.pyplot as plt

root = Path("../../ROS-Examples")
earnings = pd.read_csv(root / "Earnings/data/earnings.csv")
earnings.head()

	height	weight	male	earn	earnk	ethnicity	education	mother_education	father_education	walk	exercise	smokenow	tense	angry	age
0	74	210.0	1	50000.0	50.0	White	16.0	16.0	16.0	3	3	2.0	0.0	0.0	45
1	66	125.0	0	60000.0	60.0	White	16.0	16.0	16.0	6	5	1.0	0.0	0.0	58
2	64	126.0	0	30000.0	30.0	White	16.0	16.0	16.0	8	1	2.0	1.0	1.0	29
3	65	200.0	0	25000.0	25.0	White	17.0	17.0	NaN	8	1	2.0	0.0	0.0	57
4	63	110.0	0	50000.0	50.0	Other	16.0	16.0	16.0	5	6	2.0	0.0	0.0	91

Predict weight from height

Code

fit_1 = smf.ols("weight ~ height", data=earnings).fit()
fit_1.params

Intercept   -173.286452
height         4.949380
dtype: float64

Code

height_new = 66
pred_mean = fit_1.predict(pd.DataFrame({"height": [height_new]})).iloc[0]
print(round(pred_mean, 1))

153.4

A predictive interval must include both coefficient uncertainty and residual variation. statsmodels exposes both via get_prediction.

Code

fit_1.get_prediction(pd.DataFrame({"height": [66]})).summary_frame(alpha=0.1)

	mean	mean_se	mean_ci_lower	mean_ci_upper	obs_ci_lower	obs_ci_upper
0	153.372642	0.692629	152.232777	154.512507	105.711817	201.033468

Center heights

Code

earnings["c_height"] = earnings["height"] - 66
fit_2 = smf.ols("weight ~ c_height", data=earnings).fit()
fit_2.params

Intercept    153.372642
c_height       4.949380
dtype: float64

The intercept is now the expected weight at 66 inches.

Code

fit_2.get_prediction(pd.DataFrame({"c_height": [4.0]})).summary_frame(alpha=0.1)

	mean	mean_se	mean_ci_lower	mean_ci_upper	obs_ci_lower	obs_ci_upper
0	173.170163	0.915626	171.663311	174.677015	125.499149	220.841177

Add a binary indicator

Code

fit_3 = smf.ols("weight ~ c_height + male", data=earnings).fit()
fit_3.params

Intercept    149.535382
c_height       3.887891
male          11.837092
dtype: float64

Code

fit_3.get_prediction(pd.DataFrame({"c_height": [4.0], "male": [0]})).summary_frame(alpha=0.1)

	mean	mean_se	mean_ci_lower	mean_ci_upper	obs_ci_lower	obs_ci_upper
0	165.086947	1.630966	162.402856	167.771039	117.817595	212.3563

Add ethnicity as a categorical predictor

Code

fit_4 = smf.ols("weight ~ c_height + male + C(ethnicity)", data=earnings).fit()
fit_4.params

Intercept                   154.328493
C(ethnicity)[T.Hispanic]     -6.152312
C(ethnicity)[T.Other]       -12.261355
C(ethnicity)[T.White]        -5.191391
c_height                      3.852135
male                         12.106320
dtype: float64

Choose the baseline by setting categorical order:

Code

earnings["eth"] = pd.Categorical(earnings["ethnicity"], categories=["White", "Black", "Hispanic", "Other"])
fit_5 = smf.ols("weight ~ c_height + male + C(eth)", data=earnings).fit()
fit_5.params

Intercept             149.137102
C(eth)[T.Black]         5.191391
C(eth)[T.Hispanic]     -0.960921
C(eth)[T.Other]        -7.069964
c_height                3.852135
male                   12.106320
dtype: float64

CmdStanPy analog

The Stan model is the same Gaussian regression used elsewhere; only the design matrix changes. This is a good example where patsy can build X, then CmdStanPy samples the regression posterior.

Code

# import patsy
# y, X = patsy.dmatrices("weight ~ c_height + male + C(eth)", earnings, return_type="dataframe")
# data = {"N": X.shape[0], "K": X.shape[1], "X": X.to_numpy(), "y": y.to_numpy().ravel()}