Simple least-squares regression

Source: Simplest/simplest_lm.Rmd

This page ports the least-squares version of the ROS “simplest” example. The point is pedagogical: fitting a line, estimating one mean, and estimating a difference in means are all ordinary linear regressions with different design matrices.

Setup

Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

rng = np.random.default_rng(2141)

A regression line for simulated data

The R code uses x = 1:20, intercept 0.2, slope 0.3, and residual standard deviation 0.5. NumPy’s random stream is not identical to R’s, but the data-generating setup is the same.

Code

x = np.arange(1, 21)
a = 0.2
b = 0.3
sigma = 0.5
y = a + b * x + sigma * rng.normal(size=len(x))
fake = pd.DataFrame({"x": x, "y": y})

fit = sm.OLS(fake["y"], sm.add_constant(fake["x"])).fit()
fit.summary()

OLS Regression Results
Dep. Variable:	y	R-squared:	0.920
Model:	OLS	Adj. R-squared:	0.916
Method:	Least Squares	F-statistic:	207.0
Date:	Sun, 31 May 2026	Prob (F-statistic):	2.58e-11
Time:	23:05:42	Log-Likelihood:	-14.916
No. Observations:	20	AIC:	33.83
Df Residuals:	18	BIC:	35.82
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	0.2513	0.250	1.006	0.328	-0.273	0.776
x	0.3000	0.021	14.388	0.000	0.256	0.344

Omnibus:	0.393	Durbin-Watson:	1.430
Prob(Omnibus):	0.822	Jarque-Bera (JB):	0.528
Skew:	0.230	Prob(JB):	0.768
Kurtosis:	2.350	Cond. No.	25.0

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Code

a_hat, b_hat = fit.params["const"], fit.params["x"]
x_bar = fake["x"].mean()

fig, ax = plt.subplots(figsize=(6, 4.4))
ax.scatter(fake["x"], fake["y"], color="black", s=25)
ax.plot(fake["x"], a_hat + b_hat * fake["x"], color="black")
ax.text(x_bar, a_hat + b_hat * x_bar, f"   y = {a_hat:.2f} + {b_hat:.2f} x", va="center")
ax.set(title="Data and fitted regression line", xlabel="x", ylabel="y")
ax.spines[["top", "right"]].set_visible(False)

Estimating a mean as a regression

A regression on only a constant estimates the sample mean. Its standard error is the same as the usual standard error of the mean.

Code

rng = np.random.default_rng(2141)
n_0 = 20
y_0 = np.round(rng.normal(2, 5, size=n_0), 1)

mean_direct = y_0.mean()
se_direct = y_0.std(ddof=1) / np.sqrt(n_0)
fit_mean = sm.OLS(y_0, np.ones((n_0, 1))).fit()

pd.DataFrame(
    {
        "direct_mean": [mean_direct],
        "regression_intercept": [fit_mean.params[0]],
        "direct_se": [se_direct],
        "regression_se": [fit_mean.bse[0]],
    }
)

	direct_mean	regression_intercept	direct_se	regression_se
0	2.515	2.515	1.171465	1.171465

Estimating a difference as a regression

Now simulate a second group. The slope on an indicator variable is exactly the difference between the two group means.

Code

rng = np.random.default_rng(2141)
n_1 = 30
y_1 = rng.normal(8, 5, size=n_1)

diff = y_1.mean() - y_0.mean()
se_0 = y_0.std(ddof=1) / np.sqrt(n_0)
se_1 = y_1.std(ddof=1) / np.sqrt(n_1)
se_diff = np.sqrt(se_0**2 + se_1**2)

y_all = np.r_[y_0, y_1]
group = np.r_[np.zeros(n_0), np.ones(n_1)]
fit_diff = sm.OLS(y_all, sm.add_constant(group)).fit()

pd.DataFrame(
    {
        "difference_in_means": [diff],
        "regression_slope": [fit_diff.params[1]],
        "two_sample_se": [se_diff],
        "ols_slope_se": [fit_diff.bse[1]],
    }
)

	difference_in_means	regression_slope	two_sample_se	ols_slope_se
0	5.767634	5.767634	1.492224	1.481843

The standard errors differ slightly because the classical OLS regression uses a pooled residual variance, whereas the direct calculation above allows different variances in the two groups. The coefficient identity itself is exact.

Code

fig, ax = plt.subplots(figsize=(5.6, 4.5))
ax.scatter(group, y_all, color="black", s=18, alpha=0.75)
ax.set_xticks([0, 1])
ax.set(xlabel="Indicator, x", ylabel="y", title="Regression on an indicator equals a difference in means")
ax.axhline(y_0.mean(), color="0.5", ls="--", lw=1)
ax.axhline(y_1.mean(), color="0.5", ls="--", lw=1)
xx = np.array([0, 1])
ax.plot(xx, fit_diff.params[0] + fit_diff.params[1] * xx, color="black")
ax.text(0.05, y_0.mean() - 1, rf"$\bar y_0$ = {y_0.mean():.2f}", color="0.25")
ax.text(0.95, y_1.mean() + 1, rf"$\bar y_1$ = {y_1.mean():.2f}", color="0.25", ha="right")
ax.text(0.4, fit_diff.params[0] + 0.5 * fit_diff.params[1] - 1, f"y = {fit_diff.params[0]:.2f} + {fit_diff.params[1]:.2f} x")
ax.spines[["top", "right"]].set_visible(False)

# Simple least-squares regression Source: `Simplest/simplest_lm.Rmd` This page ports the least-squares version of the ROS "simplest" example. The point is pedagogical: fitting a line, estimating one mean, and estimating a difference in means are all ordinary linear regressions with different design matrices. ## Setup ```{python} import numpy as np import pandas as pd import matplotlib.pyplot as plt import statsmodels.api as sm rng = np.random.default_rng(2141) ``` ## A regression line for simulated data The R code uses `x = 1:20`, intercept `0.2`, slope `0.3`, and residual standard deviation `0.5`. NumPy's random stream is not identical to R's, but the data-generating setup is the same. ```{python} x = np.arange(1, 21) a = 0.2 b = 0.3 sigma = 0.5 y = a + b * x + sigma * rng.normal(size=len(x)) fake = pd.DataFrame({"x": x, "y": y}) fit = sm.OLS(fake["y"], sm.add_constant(fake["x"])).fit() fit.summary() ``` ```{python} a_hat, b_hat = fit.params["const"], fit.params["x"] x_bar = fake["x"].mean() fig, ax = plt.subplots(figsize=(6, 4.4)) ax.scatter(fake["x"], fake["y"], color="black", s=25) ax.plot(fake["x"], a_hat + b_hat * fake["x"], color="black") ax.text(x_bar, a_hat + b_hat * x_bar, f" y = {a_hat:.2f} + {b_hat:.2f} x", va="center") ax.set(title="Data and fitted regression line", xlabel="x", ylabel="y") ax.spines[["top", "right"]].set_visible(False) ``` ## Estimating a mean as a regression A regression on only a constant estimates the sample mean. Its standard error is the same as the usual standard error of the mean. ```{python} rng = np.random.default_rng(2141) n_0 = 20 y_0 = np.round(rng.normal(2, 5, size=n_0), 1) mean_direct = y_0.mean() se_direct = y_0.std(ddof=1) / np.sqrt(n_0) fit_mean = sm.OLS(y_0, np.ones((n_0, 1))).fit() pd.DataFrame( { "direct_mean": [mean_direct], "regression_intercept": [fit_mean.params[0]], "direct_se": [se_direct], "regression_se": [fit_mean.bse[0]], } ) ``` ## Estimating a difference as a regression Now simulate a second group. The slope on an indicator variable is exactly the difference between the two group means. ```{python} rng = np.random.default_rng(2141) n_1 = 30 y_1 = rng.normal(8, 5, size=n_1) diff = y_1.mean() - y_0.mean() se_0 = y_0.std(ddof=1) / np.sqrt(n_0) se_1 = y_1.std(ddof=1) / np.sqrt(n_1) se_diff = np.sqrt(se_0**2 + se_1**2) y_all = np.r_[y_0, y_1] group = np.r_[np.zeros(n_0), np.ones(n_1)] fit_diff = sm.OLS(y_all, sm.add_constant(group)).fit() pd.DataFrame( { "difference_in_means": [diff], "regression_slope": [fit_diff.params[1]], "two_sample_se": [se_diff], "ols_slope_se": [fit_diff.bse[1]], } ) ``` The standard errors differ slightly because the classical OLS regression uses a pooled residual variance, whereas the direct calculation above allows different variances in the two groups. The coefficient identity itself is exact. ```{python} fig, ax = plt.subplots(figsize=(5.6, 4.5)) ax.scatter(group, y_all, color="black", s=18, alpha=0.75) ax.set_xticks([0, 1]) ax.set(xlabel="Indicator, x", ylabel="y", title="Regression on an indicator equals a difference in means") ax.axhline(y_0.mean(), color="0.5", ls="--", lw=1) ax.axhline(y_1.mean(), color="0.5", ls="--", lw=1) xx = np.array([0, 1]) ax.plot(xx, fit_diff.params[0] + fit_diff.params[1] * xx, color="black") ax.text(0.05, y_0.mean() - 1, rf"$\bar y_0$ = {y_0.mean():.2f}", color="0.25") ax.text(0.95, y_1.mean() + 1, rf"$\bar y_1$ = {y_1.mean():.2f}", color="0.25", ha="right") ax.text(0.4, fit_diff.params[0] + 0.5 * fit_diff.params[1] - 1, f"y = {fit_diff.params[0]:.2f} + {fit_diff.params[1]:.2f} x") ax.spines[["top", "right"]].set_visible(False) ```