Importance sampling, PSIS, and the normal approximation trap

Source notebooks: demo10_1.ipynb, demo10_2.ipynb, demo10_3.ipynb

BDA Chapter 10 uses simple drawings to separate three related ideas:

rejection sampling needs an envelope $M g(\theta)$ above the unnormalized target $q(\theta)$;
importance sampling only needs draws from a proposal $g(\theta)$ and weights $w=q/g$;
Pareto-smoothed importance sampling (PSIS) diagnoses and stabilizes the largest weights.

The key computational object is the log weight:

\[ \log w_s = \log q(\theta_s) - \log g(\theta_s), \qquad \theta_s \sim g. \]

The demos start with a toy one-dimensional target and then apply the same idea to the bioassay posterior.

Plain importance sampling

If q_logpdf is an unnormalized target and proposal is a distribution object, the skeleton is:

Code

import numpy as np
from scipy.stats import norm, t

rng = np.random.default_rng(123)
proposal = t(df=4, loc=0, scale=1.5)
q_logpdf = norm(0, 1).logpdf
f = lambda theta: theta**2

theta = proposal.rvs(size=4000, random_state=rng)
log_w = q_logpdf(theta) - proposal.logpdf(theta)
w = np.exp(log_w - log_w.max())
w = w / w.sum()
estimate = np.sum(w * f(theta))
print({"E[theta^2]": estimate, "ess_is": 1 / np.sum(w**2)})

{'E[theta^2]': np.float64(0.9924712792854424), 'ess_is': np.float64(3049.0421638404246)}

The subtraction of log_w.max() does not change normalized weights; it only prevents overflow. This tiny numerical habit matters a lot once the target is a posterior from a real model.

PSIS in the bioassay normal approximation

demo10_3 fits a normal approximation around the bioassay posterior mode, draws from that approximation, and reweights the draws back toward the exact grid posterior. In Python/ArviZ, the diagnostic line is short:

Code

import arviz as az

# Same toy target/proposal as above; in the bioassay page these would be
# the exact log posterior and normal-approximation proposal log density.
log_q = q_logpdf(theta)
log_g = proposal.logpdf(theta)
log_w_smooth, khat = az.stats.psislw(log_q - log_g)
float(khat)

-1.5213450466766412

The Pareto $\hat{k}$ is the warning light. Small values mean the proposal’s tails are good enough for stable importance sampling. Large values mean a few draws dominate, so the apparent sample size is much smaller than the number of proposal draws.

Why the normal approximation fails here

The bioassay posterior is curved and has a long tail in the LD50 transformation. A Gaussian centered at the posterior mode catches the local shape, but it does not automatically cover the tail that matters for ld50 = -alpha / beta. Importance weighting can repair some local mismatch, but it cannot reliably recover posterior mass in regions the proposal almost never visits.

That is the practical lesson: importance sampling is not magic. It turns proposal draws into posterior expectations only when the proposal has adequate support and tail behavior.

House-style implementation notes

Use this pattern when translating the BDA notebooks:

Code

def normalize_log_weights(log_w):
    shifted = log_w - np.max(log_w)
    w = np.exp(shifted)
    return w / np.sum(w)

# for diagnostics and smoothing, prefer ArviZ
log_w_smooth, pareto_k = az.psislw(log_w)

Then report both the estimate and the diagnostic:

Code

w = np.exp(log_w_smooth - np.max(log_w_smooth))
w = w / w.sum()
quantity = f(theta)
print({
    "estimate": np.sum(w * quantity),
    "pareto_k": float(khat),
    "ess_is": 1 / np.sum(w**2),
})

{'estimate': np.float64(0.9924714981441027), 'pareto_k': -1.5213450466766412, 'ess_is': np.float64(3049.0424579181026)}

Connection to LOO

The same PSIS machinery appears later as PSIS-LOO. There, the proposal is not a separate normal approximation; it is the full posterior with one observation left in, reused to approximate each leave-one-out posterior. The failure mode is the same: if leaving out an observation changes the posterior too much, the importance weights get heavy tails and $\hat{k}$ reports the problem.

This is why the BDA sampling demos belong next to the ROS model-checking examples. They explain what the diagnostic is measuring, rather than treating az.loo() as a black box.

# Importance sampling, PSIS, and the normal approximation trap Source notebooks: [`demo10_1.ipynb`](https://github.com/avehtari/BDA_py_demos/blob/master/demos_ch10/demo10_1.ipynb), [`demo10_2.ipynb`](https://github.com/avehtari/BDA_py_demos/blob/master/demos_ch10/demo10_2.ipynb), [`demo10_3.ipynb`](https://github.com/avehtari/BDA_py_demos/blob/master/demos_ch10/demo10_3.ipynb) BDA Chapter 10 uses simple drawings to separate three related ideas: - rejection sampling needs an envelope $M g(\theta)$ above the unnormalized target $q(\theta)$; - importance sampling only needs draws from a proposal $g(\theta)$ and weights $w=q/g$; - Pareto-smoothed importance sampling (PSIS) diagnoses and stabilizes the largest weights. The key computational object is the log weight: $$ \log w_s = \log q(\theta_s) - \log g(\theta_s), \qquad \theta_s \sim g. $$ The demos start with a toy one-dimensional target and then apply the same idea to the bioassay posterior. ## Plain importance sampling If `q_logpdf` is an unnormalized target and `proposal` is a distribution object, the skeleton is: ```{python} import numpy as np from scipy.stats import norm, t rng = np.random.default_rng(123) proposal = t(df=4, loc=0, scale=1.5) q_logpdf = norm(0, 1).logpdf f = lambda theta: theta**2 theta = proposal.rvs(size=4000, random_state=rng) log_w = q_logpdf(theta) - proposal.logpdf(theta) w = np.exp(log_w - log_w.max()) w = w / w.sum() estimate = np.sum(w * f(theta)) print({"E[theta^2]": estimate, "ess_is": 1 / np.sum(w**2)}) ``` The subtraction of `log_w.max()` does not change normalized weights; it only prevents overflow. This tiny numerical habit matters a lot once the target is a posterior from a real model. ## PSIS in the bioassay normal approximation `demo10_3` fits a normal approximation around the bioassay posterior mode, draws from that approximation, and reweights the draws back toward the exact grid posterior. In Python/ArviZ, the diagnostic line is short: ```{python} import arviz as az # Same toy target/proposal as above; in the bioassay page these would be # the exact log posterior and normal-approximation proposal log density. log_q = q_logpdf(theta) log_g = proposal.logpdf(theta) log_w_smooth, khat = az.stats.psislw(log_q - log_g) float(khat) ``` The Pareto $\hat{k}$ is the warning light. Small values mean the proposal's tails are good enough for stable importance sampling. Large values mean a few draws dominate, so the apparent sample size is much smaller than the number of proposal draws. ## Why the normal approximation fails here The bioassay posterior is curved and has a long tail in the LD50 transformation. A Gaussian centered at the posterior mode catches the local shape, but it does not automatically cover the tail that matters for `ld50 = -alpha / beta`. Importance weighting can repair some local mismatch, but it cannot reliably recover posterior mass in regions the proposal almost never visits. That is the practical lesson: importance sampling is not magic. It turns proposal draws into posterior expectations only when the proposal has adequate support and tail behavior. ## House-style implementation notes Use this pattern when translating the BDA notebooks: ```{python} def normalize_log_weights(log_w): shifted = log_w - np.max(log_w) w = np.exp(shifted) return w / np.sum(w) # for diagnostics and smoothing, prefer ArviZ log_w_smooth, pareto_k = az.psislw(log_w) ``` Then report both the estimate and the diagnostic: ```{python} w = np.exp(log_w_smooth - np.max(log_w_smooth)) w = w / w.sum() quantity = f(theta) print({ "estimate": np.sum(w * quantity), "pareto_k": float(khat), "ess_is": 1 / np.sum(w**2), }) ``` ## Connection to LOO The same PSIS machinery appears later as PSIS-LOO. There, the proposal is not a separate normal approximation; it is the full posterior with one observation left in, reused to approximate each leave-one-out posterior. The failure mode is the same: if leaving out an observation changes the posterior too much, the importance weights get heavy tails and $\hat{k}$ reports the problem. This is why the BDA sampling demos belong next to the ROS model-checking examples. They explain what the diagnostic is measuring, rather than treating `az.loo()` as a black box.