Fake midterm and final exam

Source: FakeMidtermFinal/simulation.Rmd

This example simulates 1,000 students with a latent ability, a noisy midterm, and a noisy final exam. Regressing final on midterm illustrates regression toward the mean: the fitted slope is positive but well below one because the midterm score is measured with noise.

Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

rng = np.random.default_rng(2243)

Simulate fake data

Code

n = 1_000
true_ability = rng.normal(50, 10, size=n)
noise_1 = rng.normal(0, 10, size=n)
noise_2 = rng.normal(0, 10, size=n)
midterm = true_ability + noise_1
final = true_ability + noise_2
exams = pd.DataFrame({"midterm": midterm, "final": final})

exams.describe().round(2)

	midterm	final
count	1000.00	1000.00
mean	50.14	49.96
std	14.46	13.66
min	-0.47	5.63
25%	40.04	41.09
50%	50.07	49.96
75%	60.05	58.67
max	99.69	94.62

Linear regression

The R version fits stan_glm(final ~ midterm). Here OLS gives the corresponding Gaussian regression estimate and standard error.

Code

fit_1 = smf.ols("final ~ midterm", data=exams).fit()
fit_1.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	27.0564	1.365	19.816	0.000	24.377	29.736
midterm	0.4567	0.026	17.457	0.000	0.405	0.508

Code

fit_line = pd.DataFrame({"midterm": np.linspace(0, 100, 101)})
fit_line["final_hat"] = fit_1.predict(fit_line)
fit_1.params

Intercept    27.056355
midterm       0.456750
dtype: float64

Midterm and final exam scores

Code

fig, ax = plt.subplots(figsize=(5, 5))
ax.scatter(exams["midterm"], exams["final"], s=12, alpha=0.45, linewidths=0)
ax.plot(fit_line["midterm"], fit_line["final_hat"], color="black", linewidth=2, label="OLS fit")
for v in np.arange(0, 101, 20):
    ax.axhline(v, color="0.85", linestyle="--", linewidth=0.8, zorder=0)
    ax.axvline(v, color="0.85", linestyle="--", linewidth=0.8, zorder=0)
ax.set_xlim(0, 100)
ax.set_ylim(0, 100)
ax.set_xlabel("Midterm exam score")
ax.set_ylabel("Final exam score")
ax.set_aspect("equal", adjustable="box")
ax.legend(frameon=False)

Why the slope is below one

The data-generating model has equal variances for ability, midterm noise, and final noise. The population slope of final on midterm is

[ = = 0.5. ]

Code

theoretical_slope = 10**2 / (10**2 + 10**2)
observed_corr = exams.corr().loc["midterm", "final"]
pd.Series({
    "theoretical slope": theoretical_slope,
    "estimated slope": fit_1.params["midterm"],
    "correlation": observed_corr,
}).round(3)

theoretical slope    0.500
estimated slope      0.457
correlation          0.484
dtype: float64

A student with an unusually high midterm score is probably above average, but part of that high score is also positive noise. The model therefore predicts a final score closer to the population mean than the midterm score itself.

# Fake midterm and final exam Source: `FakeMidtermFinal/simulation.Rmd` This example simulates 1,000 students with a latent ability, a noisy midterm, and a noisy final exam. Regressing final on midterm illustrates regression toward the mean: the fitted slope is positive but well below one because the midterm score is measured with noise. ```{python} import numpy as np import pandas as pd import matplotlib.pyplot as plt import statsmodels.formula.api as smf rng = np.random.default_rng(2243) ``` ## Simulate fake data ```{python} n = 1_000 true_ability = rng.normal(50, 10, size=n) noise_1 = rng.normal(0, 10, size=n) noise_2 = rng.normal(0, 10, size=n) midterm = true_ability + noise_1 final = true_ability + noise_2 exams = pd.DataFrame({"midterm": midterm, "final": final}) exams.describe().round(2) ``` ## Linear regression The R version fits `stan_glm(final ~ midterm)`. Here OLS gives the corresponding Gaussian regression estimate and standard error. ```{python} fit_1 = smf.ols("final ~ midterm", data=exams).fit() fit_1.summary().tables[1] ``` ```{python} fit_line = pd.DataFrame({"midterm": np.linspace(0, 100, 101)}) fit_line["final_hat"] = fit_1.predict(fit_line) fit_1.params ``` ## Midterm and final exam scores ```{python} fig, ax = plt.subplots(figsize=(5, 5)) ax.scatter(exams["midterm"], exams["final"], s=12, alpha=0.45, linewidths=0) ax.plot(fit_line["midterm"], fit_line["final_hat"], color="black", linewidth=2, label="OLS fit") for v in np.arange(0, 101, 20): ax.axhline(v, color="0.85", linestyle="--", linewidth=0.8, zorder=0) ax.axvline(v, color="0.85", linestyle="--", linewidth=0.8, zorder=0) ax.set_xlim(0, 100) ax.set_ylim(0, 100) ax.set_xlabel("Midterm exam score") ax.set_ylabel("Final exam score") ax.set_aspect("equal", adjustable="box") ax.legend(frameon=False) ``` ## Why the slope is below one The data-generating model has equal variances for ability, midterm noise, and final noise. The population slope of final on midterm is \[ \frac{\operatorname{Var}(\text{ability})}{\operatorname{Var}(\text{ability}) + \operatorname{Var}(\text{midterm noise})} = \frac{10^2}{10^2 + 10^2} = 0.5. \] ```{python} theoretical_slope = 10**2 / (10**2 + 10**2) observed_corr = exams.corr().loc["midterm", "final"] pd.Series({ "theoretical slope": theoretical_slope, "estimated slope": fit_1.params["midterm"], "correlation": observed_corr, }).round(3) ``` A student with an unusually high midterm score is probably above average, but part of that high score is also positive noise. The model therefore predicts a final score closer to the population mean than the midterm score itself.