Simple least-squares regression

Source: Simplest/simplest_lm.Rmd

This page ports the least-squares version of the ROS “simplest” example. The point is pedagogical: fitting a line, estimating one mean, and estimating a difference in means are all ordinary linear regressions with different design matrices.

Setup

Code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

rng = np.random.default_rng(2141)

A regression line for simulated data

The R code uses x = 1:20, intercept 0.2, slope 0.3, and residual standard deviation 0.5. NumPy’s random stream is not identical to R’s, but the data-generating setup is the same.

Code
x = np.arange(1, 21)
a = 0.2
b = 0.3
sigma = 0.5
y = a + b * x + sigma * rng.normal(size=len(x))
fake = pd.DataFrame({"x": x, "y": y})

fit = sm.OLS(fake["y"], sm.add_constant(fake["x"])).fit()
fit.summary()
OLS Regression Results
Dep. Variable: y R-squared: 0.920
Model: OLS Adj. R-squared: 0.916
Method: Least Squares F-statistic: 207.0
Date: Sun, 31 May 2026 Prob (F-statistic): 2.58e-11
Time: 23:05:42 Log-Likelihood: -14.916
No. Observations: 20 AIC: 33.83
Df Residuals: 18 BIC: 35.82
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 0.2513 0.250 1.006 0.328 -0.273 0.776
x 0.3000 0.021 14.388 0.000 0.256 0.344
Omnibus: 0.393 Durbin-Watson: 1.430
Prob(Omnibus): 0.822 Jarque-Bera (JB): 0.528
Skew: 0.230 Prob(JB): 0.768
Kurtosis: 2.350 Cond. No. 25.0


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Code
a_hat, b_hat = fit.params["const"], fit.params["x"]
x_bar = fake["x"].mean()

fig, ax = plt.subplots(figsize=(6, 4.4))
ax.scatter(fake["x"], fake["y"], color="black", s=25)
ax.plot(fake["x"], a_hat + b_hat * fake["x"], color="black")
ax.text(x_bar, a_hat + b_hat * x_bar, f"   y = {a_hat:.2f} + {b_hat:.2f} x", va="center")
ax.set(title="Data and fitted regression line", xlabel="x", ylabel="y")
ax.spines[["top", "right"]].set_visible(False)

Estimating a mean as a regression

A regression on only a constant estimates the sample mean. Its standard error is the same as the usual standard error of the mean.

Code
rng = np.random.default_rng(2141)
n_0 = 20
y_0 = np.round(rng.normal(2, 5, size=n_0), 1)

mean_direct = y_0.mean()
se_direct = y_0.std(ddof=1) / np.sqrt(n_0)
fit_mean = sm.OLS(y_0, np.ones((n_0, 1))).fit()

pd.DataFrame(
    {
        "direct_mean": [mean_direct],
        "regression_intercept": [fit_mean.params[0]],
        "direct_se": [se_direct],
        "regression_se": [fit_mean.bse[0]],
    }
)
direct_mean regression_intercept direct_se regression_se
0 2.515 2.515 1.171465 1.171465

Estimating a difference as a regression

Now simulate a second group. The slope on an indicator variable is exactly the difference between the two group means.

Code
rng = np.random.default_rng(2141)
n_1 = 30
y_1 = rng.normal(8, 5, size=n_1)

diff = y_1.mean() - y_0.mean()
se_0 = y_0.std(ddof=1) / np.sqrt(n_0)
se_1 = y_1.std(ddof=1) / np.sqrt(n_1)
se_diff = np.sqrt(se_0**2 + se_1**2)

y_all = np.r_[y_0, y_1]
group = np.r_[np.zeros(n_0), np.ones(n_1)]
fit_diff = sm.OLS(y_all, sm.add_constant(group)).fit()

pd.DataFrame(
    {
        "difference_in_means": [diff],
        "regression_slope": [fit_diff.params[1]],
        "two_sample_se": [se_diff],
        "ols_slope_se": [fit_diff.bse[1]],
    }
)
difference_in_means regression_slope two_sample_se ols_slope_se
0 5.767634 5.767634 1.492224 1.481843

The standard errors differ slightly because the classical OLS regression uses a pooled residual variance, whereas the direct calculation above allows different variances in the two groups. The coefficient identity itself is exact.

Code
fig, ax = plt.subplots(figsize=(5.6, 4.5))
ax.scatter(group, y_all, color="black", s=18, alpha=0.75)
ax.set_xticks([0, 1])
ax.set(xlabel="Indicator, x", ylabel="y", title="Regression on an indicator equals a difference in means")
ax.axhline(y_0.mean(), color="0.5", ls="--", lw=1)
ax.axhline(y_1.mean(), color="0.5", ls="--", lw=1)
xx = np.array([0, 1])
ax.plot(xx, fit_diff.params[0] + fit_diff.params[1] * xx, color="black")
ax.text(0.05, y_0.mean() - 1, rf"$\bar y_0$ = {y_0.mean():.2f}", color="0.25")
ax.text(0.95, y_1.mean() + 1, rf"$\bar y_1$ = {y_1.mean():.2f}", color="0.25", ha="right")
ax.text(0.4, fit_diff.params[0] + 0.5 * fit_diff.params[1] - 1, f"y = {fit_diff.params[0]:.2f} + {fit_diff.params[1]:.2f} x")
ax.spines[["top", "right"]].set_visible(False)