This example predicts weight from height and then adds binary and categorical indicators.
Code
from pathlib import Pathimport numpy as npimport pandas as pdimport statsmodels.formula.api as smfimport matplotlib.pyplot as pltroot = Path("../../ROS-Examples")earnings = pd.read_csv(root /"Earnings/data/earnings.csv")earnings.head()
The Stan model is the same Gaussian regression used elsewhere; only the design matrix changes. This is a good example where patsy can build X, then CmdStanPy samples the regression posterior.
Code
# import patsy# y, X = patsy.dmatrices("weight ~ c_height + male + C(eth)", earnings, return_type="dataframe")# data = {"N": X.shape[0], "K": X.shape[1], "X": X.to_numpy(), "y": y.to_numpy().ravel()}
# Height and weight: prediction and indicatorsSource: `Earnings/height_and_weight.Rmd`This example predicts weight from height and then adds binary and categorical indicators.```{python}from pathlib import Pathimport numpy as npimport pandas as pdimport statsmodels.formula.api as smfimport matplotlib.pyplot as pltroot = Path("../../ROS-Examples")earnings = pd.read_csv(root /"Earnings/data/earnings.csv")earnings.head()```## Predict weight from height```{python}fit_1 = smf.ols("weight ~ height", data=earnings).fit()fit_1.params``````{python}height_new =66pred_mean = fit_1.predict(pd.DataFrame({"height": [height_new]})).iloc[0]print(round(pred_mean, 1))```A predictive interval must include both coefficient uncertainty and residual variation. `statsmodels` exposes both via `get_prediction`.```{python}fit_1.get_prediction(pd.DataFrame({"height": [66]})).summary_frame(alpha=0.1)```## Center heights```{python}earnings["c_height"] = earnings["height"] -66fit_2 = smf.ols("weight ~ c_height", data=earnings).fit()fit_2.params```The intercept is now the expected weight at 66 inches.```{python}fit_2.get_prediction(pd.DataFrame({"c_height": [4.0]})).summary_frame(alpha=0.1)```## Add a binary indicator```{python}fit_3 = smf.ols("weight ~ c_height + male", data=earnings).fit()fit_3.params``````{python}fit_3.get_prediction(pd.DataFrame({"c_height": [4.0], "male": [0]})).summary_frame(alpha=0.1)```## Add ethnicity as a categorical predictor```{python}fit_4 = smf.ols("weight ~ c_height + male + C(ethnicity)", data=earnings).fit()fit_4.params```Choose the baseline by setting categorical order:```{python}earnings["eth"] = pd.Categorical(earnings["ethnicity"], categories=["White", "Black", "Hispanic", "Other"])fit_5 = smf.ols("weight ~ c_height + male + C(eth)", data=earnings).fit()fit_5.params```## CmdStanPy analogThe Stan model is the same Gaussian regression used elsewhere; only the design matrix changes. This is a good example where `patsy` can build `X`, then CmdStanPy samples the regression posterior.```{python}# import patsy# y, X = patsy.dmatrices("weight ~ c_height + male + C(eth)", earnings, return_type="dataframe")# data = {"N": X.shape[0], "K": X.shape[1], "X": X.to_numpy(), "y": y.to_numpy().ravel()}```