Score Matching for Statistical Estimation and Inference

Antitonic score projection, convex M-estimation, and score-based simulation

Problem

Score matching estimates derivatives of log densities without requiring a normalized likelihood. In regression, this can be used to build estimating equations that adapt to the unknown error distribution. Feng, Kao, Xu, and Samworth develop this idea for linear regression by constructing a data-driven convex loss whose empirical minimizer has optimal asymptotic covariance among convex M-estimators.

The object of interest is a regression model

\[ Y_i = X_i'\beta_0 + \epsilon_i, \]

where the error distribution is unknown and may be heavy-tailed or non-Gaussian. The goal is to estimate $\beta_0$ efficiently and robustly while preserving a convex optimization problem.

An econometrician can read the method as adaptive GMM with a special scalar transformation of the residual. Ordinary least squares uses the residual itself in the moment condition. Antitonic score matching learns a residual score from the data, imposes the shape restriction needed for a convex loss, and then uses that score in the estimating equation.

Scores and M-Estimation

Start from the ordinary statistical learning problem. We observe pairs $(Y_i,X_i)$ and want a predictor $f_\theta(X_i)$ for the label $Y_i$. A generic empirical risk estimator solves

\[ \hat\theta \in \arg\min_\theta \frac{1}{n} \sum_{i=1}^n \mathcal{L}\{Y_i,f_\theta(X_i)\}. \]

Nothing in this formulation says that the loss must be squared error. Squared error, absolute error, quantile loss, logistic loss, and likelihood-based losses are all choices for $\mathcal{L}$. The choice of loss determines what aspect of the conditional distribution is being targeted and how observations with large errors influence the estimate.

For the linear regression model, take

\[ f_\beta(X_i) = X_i'\beta. \]

Many regression losses depend on the data only through the scalar residual

\[ r_i(\beta) = Y_i - X_i'\beta. \]

This residual reduction gives the M-estimation form

\[ \hat\beta \in \arg\min_\beta \frac{1}{n} \sum_{i=1}^n \ell(r_i(\beta)). \]

OLS chooses $\ell(r)=r^2/2$. LAD chooses $\ell(r)=|r|$. Huber regression chooses a loss that is quadratic near zero and linear in the tails. These are different estimators because they choose different ways to price residuals.

Likelihood is another way to choose $\ell$. Suppose the regression error has density $p$ and

\[ Y_i = X_i'\beta_0 + \epsilon_i, \qquad \epsilon_i \sim p. \]

Conditional on $X_i$, the density of $Y_i$ under candidate $\beta$ is

\[ p(Y_i-X_i'\beta) = p(r_i(\beta)). \]

The log likelihood is

\[ \sum_{i=1}^n \log p(r_i(\beta)). \]

Maximizing this likelihood is the same as minimizing the residual loss

\[ \ell_0(r) = -\log p(r). \]

Thus L2 loss is not the starting point. It is the likelihood loss for Gaussian errors. Laplace errors give LAD. Cauchy errors give a nonconvex log loss. The antitonic score-matching method asks whether one can learn a good residual transformation from the data while retaining a convex optimization problem.

Now define the residual score

\[ \psi_0(r) = \frac{\partial}{\partial r}\log p(r) = \frac{p'(r)}{p(r)}. \]

This is a derivative with respect to the scalar residual, but the residual is not the optimization variable. The parameter is still $\beta$. The residual derivative appears by the chain rule when differentiating the likelihood with respect to $\beta$:

\[ L(\beta) = \sum_{i=1}^n \log p(r_i(\beta)). \]

\[ \frac{\partial L(\beta)}{\partial \beta} = \sum_{i=1}^n \frac{\partial \log p(r_i)}{\partial r_i} \frac{\partial r_i(\beta)}{\partial \beta}. \]

Since $\partial r_i(\beta)/\partial\beta=-X_i$, the parameter score is

\[ \frac{\partial L(\beta)}{\partial \beta} = -\sum_{i=1}^n X_i \frac{p'(r_i(\beta))}{p(r_i(\beta))}. \]

Setting this to zero gives the likelihood estimating equation. The residual score is the scalar component of the parameter score; it is not a separate optimization over residuals.

It is convenient to write a general residual transformation as $\psi$. If $\ell$ is differentiable, define

\[ \psi(r) = -\ell'(r). \]

Then the M-estimator solves

\[ \frac{1}{n} \sum_{i=1}^n X_i\psi(Y_i-X_i'\hat\beta_\psi) = 0. \]

For Gaussian errors,

\[ \ell(r)=\frac{r^2}{2\sigma^2}, \qquad \psi(r)=-\frac{r}{\sigma^2}. \]

Multiplying a just-identified moment by a nonzero constant does not change its root, so the estimating equation is equivalent to

\[ \sum_{i=1}^n X_i r_i(\beta), \]

which gives the OLS normal equations. The Gaussian case is therefore a special case with a linear residual score, not the primitive object from which the rest of the theory starts.

The function $\psi$ determines the estimating equation. It is not the Newton-Raphson update direction. A numerical Newton step uses derivatives of the sample objective to move $\beta$ toward a solution; $\psi$ is the residual transformation that defines the sample objective or moment condition in the first place.

Under standard regularity conditions and independence of $X_i$ and $\epsilon_i$,

\[ \sqrt{n}(\hat\beta_\psi - \beta_0) \Rightarrow N\left( 0, V_p(\psi)\{E(X_iX_i')\}^{-1} \right), \]

with

\[ V_p(\psi) = \frac{E[\psi(\epsilon)^2]}{E[\psi'(\epsilon)]^2}. \]

The likelihood score $\psi_0$ minimizes this scale factor over sufficiently regular scores. But there is a catch: if the induced loss $\ell = -\log p$ is not convex, empirical risk minimization can be computationally awkward and can have undesirable local minima.

Relation to OLS and GMM

OLS solves the moment condition

\[ E[X_i(Y_i-X_i'\beta_0)] = 0. \]

This is the same as using the score-like residual transformation $\psi(u)=u$, up to the sign convention for $\psi=-\ell'$. The squared-error loss therefore corresponds to a linear residual score.

Antitonic score matching replaces this linear residual score with a learned score:

\[ E[X_i\psi(Y_i-X_i'\beta_0)] = 0. \]

So the estimator is a Z-estimator and can be written as an exactly identified GMM estimator. The novelty is not the existence of a moment condition. The novelty is the data-driven choice of $\psi$, together with a monotonicity constraint that makes the implied loss convex.

For a fixed $\psi$, the generic sandwich variance is

\[ A^{-1}BA^{-1}, \]

where

\[ A = E[-X_iX_i'\psi'(\epsilon_i)] \]

and

\[ B = E[X_iX_i'\psi(\epsilon_i)^2]. \]

If $\epsilon_i$ is independent of $X_i$, this simplifies to

\[ A = E(X_iX_i')E[-\psi'(\epsilon_i)] \]

and

\[ B = E(X_iX_i')E[\psi(\epsilon_i)^2]. \]

The asymptotic covariance becomes

\[ \frac{E[\psi(\epsilon)^2]}{E[-\psi'(\epsilon)]^2} \{E(X_iX_i')\}^{-1}. \]

For OLS, $\psi(u)=u$, so this is $\sigma^2\{E(X_iX_i')\}^{-1}$ under homoskedasticity. With heteroskedasticity, the usual Huber-White covariance keeps the full meat

\[ E[X_iX_i'\epsilon_i^2]. \]

The antitonic score-matching variance is therefore not generally the OLS Huber-White variance. It uses $\psi(\hat\epsilon_i)^2$ in the meat and the derivative of the learned score in the bread. It coincides with OLS only when the learned score is essentially linear. In heavy-tailed or non-log-concave settings, the point is precisely that the learned score differs from the residual.

Two Small DGPs

The following code uses the same notation as the previous section. The helper sandwich_for_score computes the exactly identified sandwich for a chosen residual score $\psi$:

\[ \hat V = \hat A^{-1}\hat B\hat A^{-1}/n, \]

where

\[ \hat A = -\frac{1}{n}\sum_{i=1}^n X_iX_i'\psi'(\hat\epsilon_i) \]

and

\[ \hat B = \frac{1}{n}\sum_{i=1}^n X_iX_i'\psi(\hat\epsilon_i)^2. \]

Code

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from trex import AntitonicScoreMatchingRegression


def add_intercept(x):
    return np.column_stack([np.ones(len(x)), np.asarray(x)])


def ols_fit(X, y):
    beta = np.linalg.lstsq(X, y, rcond=None)[0]
    resid = y - X @ beta
    n, k = X.shape
    xtx_inv = np.linalg.inv(X.T @ X)
    sigma2 = resid @ resid / (n - k)
    classical = sigma2 * xtx_inv
    meat = X.T @ (resid[:, None] ** 2 * X)
    hc1 = (n / (n - k)) * xtx_inv @ meat @ xtx_inv
    return beta, resid, classical, hc1


def sandwich_for_score(X, resid, psi, psi_deriv, correction=True):
    n, k = X.shape
    score_values = psi(resid)
    deriv_values = psi_deriv(resid)
    A = -(X.T * deriv_values) @ X / n
    B = X.T @ (score_values[:, None] ** 2 * X) / n
    vcov = np.linalg.pinv(A) @ B @ np.linalg.pinv(A) / n
    if correction:
        vcov *= n / (n - k)
    return vcov


def format_estimates(names, betas, vcovs):
    rows = []
    for name, beta, vcov in zip(names, betas, vcovs):
        se = np.sqrt(np.diag(vcov))
        rows.append(
            {
                "estimator": name,
                "intercept": beta[0],
                "slope": beta[1],
                "se_intercept": se[0],
                "se_slope": se[1],
            }
        )
    return pd.DataFrame(rows).round(4)

Linear Score: OLS Exactly

Take a clean homoskedastic Gaussian DGP:

\[ X_i \sim N(0,1), \qquad Y_i = 1 + 2X_i + \epsilon_i, \qquad \epsilon_i \sim N(0,1). \]

With the squared-error loss $\ell(u)=u^2/2$, the score convention above gives $\psi(u)=-u$ because $\psi=-\ell'$. Multiplying a just-identified moment by $-1$ does not change its root, so this is OLS. The bread and meat in the generic sandwich also reduce to the usual Huber-White sandwich.

Code

rng = np.random.default_rng(20260525)
n = 500
x = rng.normal(size=n)
X = add_intercept(x)
y = 1.0 + 2.0 * x + rng.normal(size=n)

ols_beta, ols_resid, ols_classical, ols_hc1 = ols_fit(X, y)

linear_beta = np.linalg.solve(X.T @ X, X.T @ y)
linear_resid = y - X @ linear_beta
linear_hc1 = sandwich_for_score(
    X,
    linear_resid,
    psi=lambda u: -u,
    psi_deriv=lambda u: -np.ones_like(u),
)

format_estimates(
    ["OLS", "linear-score GMM"],
    [ols_beta, linear_beta],
    [ols_hc1, linear_hc1],
)

	estimator	intercept	slope	se_intercept	se_slope
0	OLS	1.0105	1.9468	0.0436	0.0443
1	linear-score GMM	1.0105	1.9468	0.0436	0.0443

The two rows are the same because they solve the same normal equations. This is not a new estimator. It is OLS written as a residual-score moment condition.

Code

pd.DataFrame(
    {
        "max_abs_beta_difference": [np.max(np.abs(ols_beta - linear_beta))],
        "max_abs_hc1_se_difference": [
            np.max(np.abs(np.sqrt(np.diag(ols_hc1)) - np.sqrt(np.diag(linear_hc1))))
        ],
    }
)

	max_abs_beta_difference	max_abs_hc1_se_difference
0	6.661338e-16	0.0

Huber Contamination with Leverage Outliers

Now use a contamination DGP:

\[ (X_i,\epsilon_i) \sim (1-\eta)F_0 + \eta G, \qquad \eta = 0.04. \]

For the clean component,

\[ X_i \sim N(0,1), \qquad \epsilon_i \sim 0.7t_3. \]

For the contaminated component,

\[ X_i \sim N(6,0.35^2), \qquad \epsilon_i \sim N(-16,2.5^2). \]

The outliers are bad in two ways: the residuals are large and the covariates are high leverage. A residual score alone does not make this a high-breakdown robust regression problem. It does, however, change the estimating equation and the sandwich variance relative to OLS.

Code

rng = np.random.default_rng(20260525)
n = 600
eta = 0.04
is_bad = rng.uniform(size=n) < eta

x = np.empty(n)
eps = np.empty(n)
x[~is_bad] = rng.normal(size=(~is_bad).sum())
eps[~is_bad] = 0.7 * rng.standard_t(df=3, size=(~is_bad).sum())
x[is_bad] = rng.normal(loc=6.0, scale=0.35, size=is_bad.sum())
eps[is_bad] = rng.normal(loc=-16.0, scale=2.5, size=is_bad.sum())

X = add_intercept(x)
y = 1.0 + 2.0 * x + eps

ols_beta, ols_resid, _, ols_hc1 = ols_fit(X, y)

For comparison, fit a fixed Huber-score M-estimator and the learned antitonic-score estimator.

Code

def huber_fit(X, y, c=1.345, max_iter=200, tol=1e-10):
    beta = np.linalg.lstsq(X, y, rcond=None)[0]
    for _ in range(max_iter):
        resid = y - X @ beta
        scale = np.median(np.abs(resid - np.median(resid))) / 0.6745
        if not np.isfinite(scale) or scale < 1e-8:
            scale = np.std(resid) if np.std(resid) > 0 else 1.0
        u = resid / scale
        weights = np.ones_like(u)
        mask = np.abs(u) > c
        weights[mask] = c / np.abs(u[mask])
        beta_next = np.linalg.lstsq(
            X * np.sqrt(weights)[:, None],
            y * np.sqrt(weights),
            rcond=None,
        )[0]
        if np.linalg.norm(beta_next - beta) < tol * (1 + np.linalg.norm(beta)):
            beta = beta_next
            break
        beta = beta_next

    resid = y - X @ beta
    scale = np.median(np.abs(resid - np.median(resid))) / 0.6745
    if not np.isfinite(scale) or scale < 1e-8:
        scale = np.std(resid) if np.std(resid) > 0 else 1.0

    def psi(u):
        return -np.clip(u / scale, -c, c) * scale

    def psi_deriv(u):
        return -(np.abs(u / scale) <= c).astype(float)

    vcov = sandwich_for_score(X, resid, psi=psi, psi_deriv=psi_deriv)
    return beta, resid, vcov


huber_beta, huber_resid, huber_vcov = huber_fit(X, y)

asm = AntitonicScoreMatchingRegression(
    pilot="lad",
    alt_iter=2,
    k=1200,
    kernel_grid_size=2048,
    max_iter=200,
    tol=1e-6,
).fit(x[:, None], y)

asm_beta = asm.beta_
asm_resid = y - X @ asm_beta
asm_vcov = sandwich_for_score(
    X,
    asm_resid,
    psi=lambda u: asm.score_(u),
    psi_deriv=lambda u: asm.score_.derivative(u),
)

format_estimates(
    ["OLS", "Huber-score M", "antitonic score matching"],
    [ols_beta, huber_beta, asm_beta],
    [ols_hc1, huber_vcov, asm_vcov],
)

	estimator	intercept	slope	se_intercept	se_slope
0	OLS	0.6861	0.4108	0.0818	0.1372
1	Huber-score M	0.8793	1.5439	0.0470	0.1080
2	antitonic score matching	0.8985	1.7086	0.0358	0.0390

The robust-score estimators no longer solve the OLS normal equations. Their point estimates and standard errors differ because $\psi(\hat\epsilon_i)$ is not proportional to $\hat\epsilon_i$. In this draw, OLS is pulled strongly by the contaminated high-leverage observations; the bounded and learned scores reduce the influence of large residuals, though they do not eliminate the leverage problem.

Code

grid = np.linspace(x.min() - 0.5, x.max() + 0.5, 200)
X_grid = add_intercept(grid)

fig, ax = plt.subplots()
ax.scatter(x[~is_bad], y[~is_bad], s=16, alpha=0.55, label="clean draws")
ax.scatter(x[is_bad], y[is_bad], s=28, alpha=0.9, label="contaminated draws")
ax.plot(grid, X_grid @ ols_beta, linewidth=2, label="OLS")
ax.plot(grid, X_grid @ huber_beta, linewidth=2, label="Huber-score M")
ax.plot(grid, X_grid @ asm_beta, linewidth=2, label="antitonic score matching")
ax.plot(grid, 1.0 + 2.0 * grid, color="black", linestyle="--", linewidth=1.5, label="clean target")
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.legend(loc="best")
ax.set_title("Leverage contamination changes the residual-score estimating equation")
plt.show()

Convex Losses and Decreasing Scores

The loss $\ell$ is convex if and only if $\ell'$ is increasing. Since $\psi = -\ell'$, convexity is equivalent to $\psi$ being decreasing. This turns the problem into a constrained score approximation problem:

\[ \psi_0^* \in \arg\min_{\psi \in \Psi_\downarrow} V_p(\psi), \]

where $\Psi_\downarrow$ is the cone of decreasing score-like functions.

The paper’s key result is that this optimal convex M-estimation score can be obtained by score matching. In population, the score-matching objective is

\[ D_p(\psi) = E[\psi(\epsilon)^2 + 2\psi'(\epsilon)]. \]

Restricting $\psi$ to the decreasing cone and minimizing $D_p(\psi)$ yields the same projected score $\psi_0^*$ that minimizes the asymptotic variance among convex M-estimators.

In this setting, score matching is a way to choose an estimating equation, not a generative-model training objective.

Antitonic Projection

The population construction is elegant. Let $F$ be the CDF of the error distribution and define the density quantile function

\[ J(u) = p(F^{-1}(u)), \qquad u \in [0,1]. \]

Let $\hat J$ be the least concave majorant of $J$. The optimal decreasing projected score is the right derivative of this concave majorant, pulled back to the residual scale:

\[ \psi_0^*(z) = \hat J^{(R)}(F(z)). \]

When the true score $\psi_0$ already decreases, the error density is log-concave and the projection gives back the likelihood score. When the true score is not decreasing, the projection produces the closest decreasing score in the Fisher-divergence geometry relevant for the convex M-estimation problem.

For Cauchy errors, the likelihood score is nonmonotone enough to make the likelihood loss nonconvex. The projected score is bounded and Huber-like, so the resulting loss is convex and robust. The paper reports that, for standard Cauchy errors, this restriction to convex losses still retains high asymptotic efficiency relative to the oracle Cauchy maximum likelihood estimator.

Sample Algorithm

The practical algorithm has five parts.

Fit a pilot regression, typically LAD or OLS.
Form residuals from that pilot fit.
Smooth the residual distribution with a kernel density estimator.
Estimate the residual score and project it onto the decreasing cone using PAVA.
Fit the final regression by convex M-estimation using the negative antiderivative of the projected score.

In notation, with pilot residuals $\hat\epsilon_i$, form a smoothed density estimate $\tilde p_n$. The raw score estimate is

\[ \tilde\psi_n(z) = \frac{\tilde p_n'(z)}{\tilde p_n(z)}. \]

The antitonic score estimate is

\[ \hat\psi_n = M_\downarrow(\tilde\psi_n \circ \tilde F_n^{-1}) \circ \tilde F_n, \]

where $M_\downarrow$ denotes the decreasing projection and $\tilde F_n$ is the CDF associated with the smoothed density. Computationally, this is done by evaluating the density quantile function on a grid and running the pool-adjacent-violators algorithm.

For symmetric errors, one can symmetrize the score estimate:

\[ \hat\psi_n^{\text{sym}}(z) = \frac{\hat\psi_n(z) - \hat\psi_n(-z)}{2}. \]

The final estimator solves

\[ \sum_{i=1}^n X_i \hat\psi_n(Y_i - X_i'\beta) = 0. \]

Equivalently, it minimizes a convex empirical risk. In code this can be solved by Newton or damped Newton steps because the learned score supplies both the gradient and a piecewise derivative.

Inference

The antitonic information is

\[ i^*(p) = E[\psi_0^*(\epsilon)^2]. \]

For symmetric errors without a nuisance intercept complication, the asymptotic covariance is

\[ \frac{1}{n} \left[ i^*(p) E(X_iX_i') \right]^{-1}. \]

The natural plug-in estimator is

\[ \hat i = \frac{1}{n} \sum_{i=1}^n \hat\psi_n(\hat\epsilon_i)^2, \]

and

\[ \hat I = \hat i \frac{X'X}{n}. \]

Then a Wald interval for coordinate $j$ is

\[ \hat\beta_j \pm z_{1-\alpha/2} \sqrt{ \frac{(\hat I^{-1})_{jj}}{n} }. \]

The full paper develops cross-fitting and intercept handling to justify this asymptotically. Practical implementations can use the same plug-in logic while keeping the numerical algorithm simpler.

Minimal Software Form

A minimal software interface needs two components:

from trex import AntitonicScoreMatchingRegression, estimate_antitonic_score

The lower-level function estimate_antitonic_score takes residuals and returns a callable piecewise-linear score estimate. It:

computes a Gaussian KDE on a grid;
constructs the smoothed CDF;
evaluates the density quantile function;
projects the discrete derivative onto the decreasing cone with PAVA;
optionally symmetrizes the result.

The regression estimator then:

builds an intercept column if requested;
computes an OLS or LAD pilot;
estimates the antitonic residual score;
solves the final convex M-estimation equations by damped Newton steps;
stores coefficients, residuals, fitted values, antitonic information, covariance estimates, and Wald intervals.

This is narrower than the full R package asm. It omits the paper’s cross-fitted theoretical estimator, asymmetric intercept selection, diagnostic plots, and prediction intervals. It implements the central practical mechanism: data-driven convex robust regression from residual score matching.

Score-Based Simulation

Score matching also appears in modern generative modeling, including diffusion models. The connection is worth understanding because it is not intrinsically about producing pictures. The useful abstraction is a way to represent, perturb, and sample from complicated distributions through their score functions.

In the regression setting above, the score is a one-dimensional residual score used to choose an estimating equation. The output is $\hat\beta$, standard errors, and confidence intervals. The score is a tool for efficient semiparametric estimation.

In score-based simulation, the score is usually a high-dimensional vector field

\[ s_\theta(x,t) \approx \nabla_x \log p_t(x), \]

where $p_t$ is a noise-corrupted version of a target distribution. The object $x$ need not be an image. It can be a vector of covariates, a panel history, a vector of latent shocks, a discretized policy function, or simulated microdata satisfying structural restrictions.

The basic construction is:

Start from draws $x_0$ from a target distribution.
Add noise through a known stochastic path, producing $x_t$.
Train a model to estimate the score $\nabla_x \log p_t(x)$ of the noised distribution.
Use the learned score to move draws from high noise back toward the target distribution.

A DDPM is one discretized version of this idea. It trains a model to predict the Gaussian noise added to $x_0$. That noise prediction implicitly estimates the score because

\[ \nabla_{x_t}\log q(x_t \mid x_0) = -\frac{\epsilon}{\sqrt{1-\bar\alpha_t}}. \]

The statistical analogy is:

residual score matching estimates a one-dimensional score to choose a robust estimating equation;
diffusion-style score modeling estimates a high-dimensional score to build a simulator;
both avoid normalized likelihoods and work directly with derivatives of log density.

The estimand determines whether this is useful. For statistical estimation, the score must improve an estimator or an inferential procedure. For simulation, the score must generate objects whose distributional or structural properties can be checked against a clear target. Without that target, score-based generation is just an impressive sampler.

Structural Econometrics Angle

The antitonic score-matching paper suggests a useful discipline for score-based structural work. The score model should be tied to a statistical target, not only to sample quality.

For simulated method of moments, the classical objective is

\[ Q(\theta) = \left[ m_{\text{data}} - E_\theta m(Y) \right]' W \left[ m_{\text{data}} - E_\theta m(Y) \right]. \]

A score-based model can enter as an auxiliary simulator or proposal mechanism, but the target remains moment fit and inference on $\theta$. The analogy to antitonic score matching is:

learn a score from data or residuals;
constrain it enough to preserve an estimation target;
use the learned score to improve optimization or simulation;
report uncertainty through the original estimand, not through visual fit alone.

For structural models, possible uses include:

robust residual-score estimation in linearized or local GMM steps;
score-guided draws of latent shocks conditional on observed choices;
score-based proposals for dynamic discrete-choice histories;
moment-guided reverse sampling where structural residuals guide the score path;
adaptive convex losses for simulated auxiliary regressions.

The useful discipline is to keep the score model attached to an estimand. In regression, that estimand is $\beta_0$. In SMM, it is $\theta$. In a generative application, it may be a distributional simulator, but then the evaluation target should be stated explicitly.

For example, in a dynamic discrete-choice model, a score-based simulator could generate latent shock histories conditional on observed state-action paths. The structural model would still impose Bellman inequalities, transition laws, and counterfactual policy restrictions. The learned score would be a computational device for exploring the latent space, not a substitute for the structural model.

Similarly, in SMM, a score-based sampler could propose simulated datasets or latent variables that are closer to the empirical support than naive Monte Carlo draws. The SMM criterion would still be evaluated through moments, weights, and sampling error:

\[ \hat\theta = \arg\min_\theta \left[ \hat m_{\text{data}} - \hat m_{\theta} \right]' W \left[ \hat m_{\text{data}} - \hat m_{\theta} \right]. \]

This is the serious reason to care about diffusion-style machinery in econometrics: not because the canonical demos generate faces, but because the same score-based machinery can act as an adaptive simulator, proposal distribution, or inverse-problem solver for objects that are hard to sample directly.

References

Oliver Y. Feng, Yu-Chun Kao, Min Xu, and Richard J. Samworth, “Optimal convex M-estimation via score matching,” arXiv:2403.16688, 2025 revision.
asm: “Optimal Convex M-Estimation for Linear Regression via Antitonic Score Matching,” R package version 0.2.4.
Aapo Hyvarinen, “Estimation of Non-Normalized Statistical Models by Score Matching,” 2005.
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole, “Score-Based Generative Modeling through Stochastic Differential Equations,” 2021.

--- title: "Score Matching for Statistical Estimation and Inference" subtitle: "Antitonic score projection, convex M-estimation, and score-based simulation" format: html: embed-resources: true page-layout: full toc: true toc-depth: 3 code-fold: true code-tools: true execute: echo: true warning: false message: false jupyter: python3 --- ## Problem Score matching estimates derivatives of log densities without requiring a normalized likelihood. In regression, this can be used to build estimating equations that adapt to the unknown error distribution. Feng, Kao, Xu, and Samworth develop this idea for linear regression by constructing a data-driven convex loss whose empirical minimizer has optimal asymptotic covariance among convex M-estimators. The object of interest is a regression model $$ Y_i = X_i'\beta_0 + \epsilon_i, $$ where the error distribution is unknown and may be heavy-tailed or non-Gaussian. The goal is to estimate $\beta_0$ efficiently and robustly while preserving a convex optimization problem. An econometrician can read the method as adaptive GMM with a special scalar transformation of the residual. Ordinary least squares uses the residual itself in the moment condition. Antitonic score matching learns a residual score from the data, imposes the shape restriction needed for a convex loss, and then uses that score in the estimating equation. ## Scores and M-Estimation Start from the ordinary statistical learning problem. We observe pairs $(Y_i,X_i)$ and want a predictor $f_\theta(X_i)$ for the label $Y_i$. A generic empirical risk estimator solves $$ \hat\theta \in \arg\min_\theta \frac{1}{n} \sum_{i=1}^n \mathcal{L}\{Y_i,f_\theta(X_i)\}. $$ Nothing in this formulation says that the loss must be squared error. Squared error, absolute error, quantile loss, logistic loss, and likelihood-based losses are all choices for $\mathcal{L}$. The choice of loss determines what aspect of the conditional distribution is being targeted and how observations with large errors influence the estimate. For the linear regression model, take $$ f_\beta(X_i) = X_i'\beta. $$ Many regression losses depend on the data only through the scalar residual $$ r_i(\beta) = Y_i - X_i'\beta. $$ This residual reduction gives the M-estimation form $$ \hat\beta \in \arg\min_\beta \frac{1}{n} \sum_{i=1}^n \ell(r_i(\beta)). $$ OLS chooses $\ell(r)=r^2/2$. LAD chooses $\ell(r)=|r|$. Huber regression chooses a loss that is quadratic near zero and linear in the tails. These are different estimators because they choose different ways to price residuals. Likelihood is another way to choose $\ell$. Suppose the regression error has density $p$ and $$ Y_i = X_i'\beta_0 + \epsilon_i, \qquad \epsilon_i \sim p. $$ Conditional on $X_i$, the density of $Y_i$ under candidate $\beta$ is $$ p(Y_i-X_i'\beta) = p(r_i(\beta)). $$ The log likelihood is $$ \sum_{i=1}^n \log p(r_i(\beta)). $$ Maximizing this likelihood is the same as minimizing the residual loss $$ \ell_0(r) = -\log p(r). $$ Thus L2 loss is not the starting point. It is the likelihood loss for Gaussian errors. Laplace errors give LAD. Cauchy errors give a nonconvex log loss. The antitonic score-matching method asks whether one can learn a good residual transformation from the data while retaining a convex optimization problem. Now define the residual score $$ \psi_0(r) = \frac{\partial}{\partial r}\log p(r) = \frac{p'(r)}{p(r)}. $$ This is a derivative with respect to the scalar residual, but the residual is not the optimization variable. The parameter is still $\beta$. The residual derivative appears by the chain rule when differentiating the likelihood with respect to $\beta$: $$ L(\beta) = \sum_{i=1}^n \log p(r_i(\beta)). $$ $$ \frac{\partial L(\beta)}{\partial \beta} = \sum_{i=1}^n \frac{\partial \log p(r_i)}{\partial r_i} \frac{\partial r_i(\beta)}{\partial \beta}. $$ Since $\partial r_i(\beta)/\partial\beta=-X_i$, the parameter score is $$ \frac{\partial L(\beta)}{\partial \beta} = -\sum_{i=1}^n X_i \frac{p'(r_i(\beta))}{p(r_i(\beta))}. $$ Setting this to zero gives the likelihood estimating equation. The residual score is the scalar component of the parameter score; it is not a separate optimization over residuals. It is convenient to write a general residual transformation as $\psi$. If $\ell$ is differentiable, define $$ \psi(r) = -\ell'(r). $$ Then the M-estimator solves $$ \frac{1}{n} \sum_{i=1}^n X_i\psi(Y_i-X_i'\hat\beta_\psi) = 0. $$ For Gaussian errors, $$ \ell(r)=\frac{r^2}{2\sigma^2}, \qquad \psi(r)=-\frac{r}{\sigma^2}. $$ Multiplying a just-identified moment by a nonzero constant does not change its root, so the estimating equation is equivalent to $$ \sum_{i=1}^n X_i r_i(\beta), $$ which gives the OLS normal equations. The Gaussian case is therefore a special case with a linear residual score, not the primitive object from which the rest of the theory starts. The function $\psi$ determines the estimating equation. It is not the Newton-Raphson update direction. A numerical Newton step uses derivatives of the sample objective to move $\beta$ toward a solution; $\psi$ is the residual transformation that defines the sample objective or moment condition in the first place. Under standard regularity conditions and independence of $X_i$ and $\epsilon_i$, $$ \sqrt{n}(\hat\beta_\psi - \beta_0) \Rightarrow N\left( 0, V_p(\psi)\{E(X_iX_i')\}^{-1} \right), $$ with $$ V_p(\psi) = \frac{E[\psi(\epsilon)^2]}{E[\psi'(\epsilon)]^2}. $$ The likelihood score $\psi_0$ minimizes this scale factor over sufficiently regular scores. But there is a catch: if the induced loss $\ell = -\log p$ is not convex, empirical risk minimization can be computationally awkward and can have undesirable local minima. ## Relation to OLS and GMM OLS solves the moment condition $$ E[X_i(Y_i-X_i'\beta_0)] = 0. $$ This is the same as using the score-like residual transformation $\psi(u)=u$, up to the sign convention for $\psi=-\ell'$. The squared-error loss therefore corresponds to a linear residual score. Antitonic score matching replaces this linear residual score with a learned score: $$ E[X_i\psi(Y_i-X_i'\beta_0)] = 0. $$ So the estimator is a Z-estimator and can be written as an exactly identified GMM estimator. The novelty is not the existence of a moment condition. The novelty is the data-driven choice of $\psi$, together with a monotonicity constraint that makes the implied loss convex. For a fixed $\psi$, the generic sandwich variance is $$ A^{-1}BA^{-1}, $$ where $$ A = E[-X_iX_i'\psi'(\epsilon_i)] $$ and $$ B = E[X_iX_i'\psi(\epsilon_i)^2]. $$ If $\epsilon_i$ is independent of $X_i$, this simplifies to $$ A = E(X_iX_i')E[-\psi'(\epsilon_i)] $$ and $$ B = E(X_iX_i')E[\psi(\epsilon_i)^2]. $$ The asymptotic covariance becomes $$ \frac{E[\psi(\epsilon)^2]}{E[-\psi'(\epsilon)]^2} \{E(X_iX_i')\}^{-1}. $$ For OLS, $\psi(u)=u$, so this is $\sigma^2\{E(X_iX_i')\}^{-1}$ under homoskedasticity. With heteroskedasticity, the usual Huber-White covariance keeps the full meat $$ E[X_iX_i'\epsilon_i^2]. $$ The antitonic score-matching variance is therefore not generally the OLS Huber-White variance. It uses $\psi(\hat\epsilon_i)^2$ in the meat and the derivative of the learned score in the bread. It coincides with OLS only when the learned score is essentially linear. In heavy-tailed or non-log-concave settings, the point is precisely that the learned score differs from the residual. ## Two Small DGPs The following code uses the same notation as the previous section. The helper `sandwich_for_score` computes the exactly identified sandwich for a chosen residual score $\psi$: $$ \hat V = \hat A^{-1}\hat B\hat A^{-1}/n, $$ where $$ \hat A = -\frac{1}{n}\sum_{i=1}^n X_iX_i'\psi'(\hat\epsilon_i) $$ and $$ \hat B = \frac{1}{n}\sum_{i=1}^n X_iX_i'\psi(\hat\epsilon_i)^2. $$ ```{python} #| label: score-helper-code #| code-fold: true import numpy as np import pandas as pd import matplotlib.pyplot as plt from trex import AntitonicScoreMatchingRegression def add_intercept(x): return np.column_stack([np.ones(len(x)), np.asarray(x)]) def ols_fit(X, y): beta = np.linalg.lstsq(X, y, rcond=None)[0] resid = y - X @ beta n, k = X.shape xtx_inv = np.linalg.inv(X.T @ X) sigma2 = resid @ resid / (n - k) classical = sigma2 * xtx_inv meat = X.T @ (resid[:, None] ** 2 * X) hc1 = (n / (n - k)) * xtx_inv @ meat @ xtx_inv return beta, resid, classical, hc1 def sandwich_for_score(X, resid, psi, psi_deriv, correction=True): n, k = X.shape score_values = psi(resid) deriv_values = psi_deriv(resid) A = -(X.T * deriv_values) @ X / n B = X.T @ (score_values[:, None] ** 2 * X) / n vcov = np.linalg.pinv(A) @ B @ np.linalg.pinv(A) / n if correction: vcov *= n / (n - k) return vcov def format_estimates(names, betas, vcovs): rows = [] for name, beta, vcov in zip(names, betas, vcovs): se = np.sqrt(np.diag(vcov)) rows.append( { "estimator": name, "intercept": beta[0], "slope": beta[1], "se_intercept": se[0], "se_slope": se[1], } ) return pd.DataFrame(rows).round(4) ``` ### Linear Score: OLS Exactly Take a clean homoskedastic Gaussian DGP: $$ X_i \sim N(0,1), \qquad Y_i = 1 + 2X_i + \epsilon_i, \qquad \epsilon_i \sim N(0,1). $$ With the squared-error loss $\ell(u)=u^2/2$, the score convention above gives $\psi(u)=-u$ because $\psi=-\ell'$. Multiplying a just-identified moment by $-1$ does not change its root, so this is OLS. The bread and meat in the generic sandwich also reduce to the usual Huber-White sandwich. ```{python} #| label: clean-linear-score-example #| code-fold: true rng = np.random.default_rng(20260525) n = 500 x = rng.normal(size=n) X = add_intercept(x) y = 1.0 + 2.0 * x + rng.normal(size=n) ols_beta, ols_resid, ols_classical, ols_hc1 = ols_fit(X, y) linear_beta = np.linalg.solve(X.T @ X, X.T @ y) linear_resid = y - X @ linear_beta linear_hc1 = sandwich_for_score( X, linear_resid, psi=lambda u: -u, psi_deriv=lambda u: -np.ones_like(u), ) format_estimates( ["OLS", "linear-score GMM"], [ols_beta, linear_beta], [ols_hc1, linear_hc1], ) ``` The two rows are the same because they solve the same normal equations. This is not a new estimator. It is OLS written as a residual-score moment condition. ```{python} #| label: clean-linear-score-difference #| code-fold: true pd.DataFrame( { "max_abs_beta_difference": [np.max(np.abs(ols_beta - linear_beta))], "max_abs_hc1_se_difference": [ np.max(np.abs(np.sqrt(np.diag(ols_hc1)) - np.sqrt(np.diag(linear_hc1)))) ], } ) ``` ### Huber Contamination with Leverage Outliers Now use a contamination DGP: $$ (X_i,\epsilon_i) \sim (1-\eta)F_0 + \eta G, \qquad \eta = 0.04. $$ For the clean component, $$ X_i \sim N(0,1), \qquad \epsilon_i \sim 0.7t_3. $$ For the contaminated component, $$ X_i \sim N(6,0.35^2), \qquad \epsilon_i \sim N(-16,2.5^2). $$ The outliers are bad in two ways: the residuals are large and the covariates are high leverage. A residual score alone does not make this a high-breakdown robust regression problem. It does, however, change the estimating equation and the sandwich variance relative to OLS. ```{python} #| label: contaminated-dgp-example #| code-fold: true rng = np.random.default_rng(20260525) n = 600 eta = 0.04 is_bad = rng.uniform(size=n) < eta x = np.empty(n) eps = np.empty(n) x[~is_bad] = rng.normal(size=(~is_bad).sum()) eps[~is_bad] = 0.7 * rng.standard_t(df=3, size=(~is_bad).sum()) x[is_bad] = rng.normal(loc=6.0, scale=0.35, size=is_bad.sum()) eps[is_bad] = rng.normal(loc=-16.0, scale=2.5, size=is_bad.sum()) X = add_intercept(x) y = 1.0 + 2.0 * x + eps ols_beta, ols_resid, _, ols_hc1 = ols_fit(X, y) ``` For comparison, fit a fixed Huber-score M-estimator and the learned antitonic-score estimator. ```{python} #| label: robust-score-fit-code #| code-fold: true def huber_fit(X, y, c=1.345, max_iter=200, tol=1e-10): beta = np.linalg.lstsq(X, y, rcond=None)[0] for _ in range(max_iter): resid = y - X @ beta scale = np.median(np.abs(resid - np.median(resid))) / 0.6745 if not np.isfinite(scale) or scale < 1e-8: scale = np.std(resid) if np.std(resid) > 0 else 1.0 u = resid / scale weights = np.ones_like(u) mask = np.abs(u) > c weights[mask] = c / np.abs(u[mask]) beta_next = np.linalg.lstsq( X * np.sqrt(weights)[:, None], y * np.sqrt(weights), rcond=None, )[0] if np.linalg.norm(beta_next - beta) < tol * (1 + np.linalg.norm(beta)): beta = beta_next break beta = beta_next resid = y - X @ beta scale = np.median(np.abs(resid - np.median(resid))) / 0.6745 if not np.isfinite(scale) or scale < 1e-8: scale = np.std(resid) if np.std(resid) > 0 else 1.0 def psi(u): return -np.clip(u / scale, -c, c) * scale def psi_deriv(u): return -(np.abs(u / scale) <= c).astype(float) vcov = sandwich_for_score(X, resid, psi=psi, psi_deriv=psi_deriv) return beta, resid, vcov huber_beta, huber_resid, huber_vcov = huber_fit(X, y) asm = AntitonicScoreMatchingRegression( pilot="lad", alt_iter=2, k=1200, kernel_grid_size=2048, max_iter=200, tol=1e-6, ).fit(x[:, None], y) asm_beta = asm.beta_ asm_resid = y - X @ asm_beta asm_vcov = sandwich_for_score( X, asm_resid, psi=lambda u: asm.score_(u), psi_deriv=lambda u: asm.score_.derivative(u), ) format_estimates( ["OLS", "Huber-score M", "antitonic score matching"], [ols_beta, huber_beta, asm_beta], [ols_hc1, huber_vcov, asm_vcov], ) ``` The robust-score estimators no longer solve the OLS normal equations. Their point estimates and standard errors differ because $\psi(\hat\epsilon_i)$ is not proportional to $\hat\epsilon_i$. In this draw, OLS is pulled strongly by the contaminated high-leverage observations; the bounded and learned scores reduce the influence of large residuals, though they do not eliminate the leverage problem. ```{python} #| label: contaminated-dgp-plot #| code-fold: true #| fig-width: 10 #| fig-height: 5 grid = np.linspace(x.min() - 0.5, x.max() + 0.5, 200) X_grid = add_intercept(grid) fig, ax = plt.subplots() ax.scatter(x[~is_bad], y[~is_bad], s=16, alpha=0.55, label="clean draws") ax.scatter(x[is_bad], y[is_bad], s=28, alpha=0.9, label="contaminated draws") ax.plot(grid, X_grid @ ols_beta, linewidth=2, label="OLS") ax.plot(grid, X_grid @ huber_beta, linewidth=2, label="Huber-score M") ax.plot(grid, X_grid @ asm_beta, linewidth=2, label="antitonic score matching") ax.plot(grid, 1.0 + 2.0 * grid, color="black", linestyle="--", linewidth=1.5, label="clean target") ax.set_xlabel("x") ax.set_ylabel("y") ax.legend(loc="best") ax.set_title("Leverage contamination changes the residual-score estimating equation") plt.show() ``` ## Convex Losses and Decreasing Scores The loss $\ell$ is convex if and only if $\ell'$ is increasing. Since $\psi = -\ell'$, convexity is equivalent to $\psi$ being decreasing. This turns the problem into a constrained score approximation problem: $$ \psi_0^* \in \arg\min_{\psi \in \Psi_\downarrow} V_p(\psi), $$ where $\Psi_\downarrow$ is the cone of decreasing score-like functions. The paper's key result is that this optimal convex M-estimation score can be obtained by score matching. In population, the score-matching objective is $$ D_p(\psi) = E[\psi(\epsilon)^2 + 2\psi'(\epsilon)]. $$ Restricting $\psi$ to the decreasing cone and minimizing $D_p(\psi)$ yields the same projected score $\psi_0^*$ that minimizes the asymptotic variance among convex M-estimators. In this setting, score matching is a way to choose an estimating equation, not a generative-model training objective. ## Antitonic Projection The population construction is elegant. Let $F$ be the CDF of the error distribution and define the density quantile function $$ J(u) = p(F^{-1}(u)), \qquad u \in [0,1]. $$ Let $\hat J$ be the least concave majorant of $J$. The optimal decreasing projected score is the right derivative of this concave majorant, pulled back to the residual scale: $$ \psi_0^*(z) = \hat J^{(R)}(F(z)). $$ When the true score $\psi_0$ already decreases, the error density is log-concave and the projection gives back the likelihood score. When the true score is not decreasing, the projection produces the closest decreasing score in the Fisher-divergence geometry relevant for the convex M-estimation problem. For Cauchy errors, the likelihood score is nonmonotone enough to make the likelihood loss nonconvex. The projected score is bounded and Huber-like, so the resulting loss is convex and robust. The paper reports that, for standard Cauchy errors, this restriction to convex losses still retains high asymptotic efficiency relative to the oracle Cauchy maximum likelihood estimator. ## Sample Algorithm The practical algorithm has five parts. 1. Fit a pilot regression, typically LAD or OLS. 2. Form residuals from that pilot fit. 3. Smooth the residual distribution with a kernel density estimator. 4. Estimate the residual score and project it onto the decreasing cone using PAVA. 5. Fit the final regression by convex M-estimation using the negative antiderivative of the projected score. In notation, with pilot residuals $\hat\epsilon_i$, form a smoothed density estimate $\tilde p_n$. The raw score estimate is $$ \tilde\psi_n(z) = \frac{\tilde p_n'(z)}{\tilde p_n(z)}. $$ The antitonic score estimate is $$ \hat\psi_n = M_\downarrow(\tilde\psi_n \circ \tilde F_n^{-1}) \circ \tilde F_n, $$ where $M_\downarrow$ denotes the decreasing projection and $\tilde F_n$ is the CDF associated with the smoothed density. Computationally, this is done by evaluating the density quantile function on a grid and running the pool-adjacent-violators algorithm. For symmetric errors, one can symmetrize the score estimate: $$ \hat\psi_n^{\text{sym}}(z) = \frac{\hat\psi_n(z) - \hat\psi_n(-z)}{2}. $$ The final estimator solves $$ \sum_{i=1}^n X_i \hat\psi_n(Y_i - X_i'\beta) = 0. $$ Equivalently, it minimizes a convex empirical risk. In code this can be solved by Newton or damped Newton steps because the learned score supplies both the gradient and a piecewise derivative. ## Inference The antitonic information is $$ i^*(p) = E[\psi_0^*(\epsilon)^2]. $$ For symmetric errors without a nuisance intercept complication, the asymptotic covariance is $$ \frac{1}{n} \left[ i^*(p) E(X_iX_i') \right]^{-1}. $$ The natural plug-in estimator is $$ \hat i = \frac{1}{n} \sum_{i=1}^n \hat\psi_n(\hat\epsilon_i)^2, $$ and $$ \hat I = \hat i \frac{X'X}{n}. $$ Then a Wald interval for coordinate $j$ is $$ \hat\beta_j \pm z_{1-\alpha/2} \sqrt{ \frac{(\hat I^{-1})_{jj}}{n} }. $$ The full paper develops cross-fitting and intercept handling to justify this asymptotically. Practical implementations can use the same plug-in logic while keeping the numerical algorithm simpler. ## Minimal Software Form A minimal software interface needs two components: ```python from trex import AntitonicScoreMatchingRegression, estimate_antitonic_score ``` The lower-level function `estimate_antitonic_score` takes residuals and returns a callable piecewise-linear score estimate. It: - computes a Gaussian KDE on a grid; - constructs the smoothed CDF; - evaluates the density quantile function; - projects the discrete derivative onto the decreasing cone with PAVA; - optionally symmetrizes the result. The regression estimator then: - builds an intercept column if requested; - computes an OLS or LAD pilot; - estimates the antitonic residual score; - solves the final convex M-estimation equations by damped Newton steps; - stores coefficients, residuals, fitted values, antitonic information, covariance estimates, and Wald intervals. This is narrower than the full R package `asm`. It omits the paper's cross-fitted theoretical estimator, asymmetric intercept selection, diagnostic plots, and prediction intervals. It implements the central practical mechanism: data-driven convex robust regression from residual score matching. ## Score-Based Simulation Score matching also appears in modern generative modeling, including diffusion models. The connection is worth understanding because it is not intrinsically about producing pictures. The useful abstraction is a way to represent, perturb, and sample from complicated distributions through their score functions. In the regression setting above, the score is a one-dimensional residual score used to choose an estimating equation. The output is $\hat\beta$, standard errors, and confidence intervals. The score is a tool for efficient semiparametric estimation. In score-based simulation, the score is usually a high-dimensional vector field $$ s_\theta(x,t) \approx \nabla_x \log p_t(x), $$ where $p_t$ is a noise-corrupted version of a target distribution. The object $x$ need not be an image. It can be a vector of covariates, a panel history, a vector of latent shocks, a discretized policy function, or simulated microdata satisfying structural restrictions. The basic construction is: 1. Start from draws $x_0$ from a target distribution. 2. Add noise through a known stochastic path, producing $x_t$. 3. Train a model to estimate the score $\nabla_x \log p_t(x)$ of the noised distribution. 4. Use the learned score to move draws from high noise back toward the target distribution. A DDPM is one discretized version of this idea. It trains a model to predict the Gaussian noise added to $x_0$. That noise prediction implicitly estimates the score because $$ \nabla_{x_t}\log q(x_t \mid x_0) = -\frac{\epsilon}{\sqrt{1-\bar\alpha_t}}. $$ The statistical analogy is: - residual score matching estimates a one-dimensional score to choose a robust estimating equation; - diffusion-style score modeling estimates a high-dimensional score to build a simulator; - both avoid normalized likelihoods and work directly with derivatives of log density. The estimand determines whether this is useful. For statistical estimation, the score must improve an estimator or an inferential procedure. For simulation, the score must generate objects whose distributional or structural properties can be checked against a clear target. Without that target, score-based generation is just an impressive sampler. ## Structural Econometrics Angle The antitonic score-matching paper suggests a useful discipline for score-based structural work. The score model should be tied to a statistical target, not only to sample quality. For simulated method of moments, the classical objective is $$ Q(\theta) = \left[ m_{\text{data}} - E_\theta m(Y) \right]' W \left[ m_{\text{data}} - E_\theta m(Y) \right]. $$ A score-based model can enter as an auxiliary simulator or proposal mechanism, but the target remains moment fit and inference on $\theta$. The analogy to antitonic score matching is: - learn a score from data or residuals; - constrain it enough to preserve an estimation target; - use the learned score to improve optimization or simulation; - report uncertainty through the original estimand, not through visual fit alone. For structural models, possible uses include: - robust residual-score estimation in linearized or local GMM steps; - score-guided draws of latent shocks conditional on observed choices; - score-based proposals for dynamic discrete-choice histories; - moment-guided reverse sampling where structural residuals guide the score path; - adaptive convex losses for simulated auxiliary regressions. The useful discipline is to keep the score model attached to an estimand. In regression, that estimand is $\beta_0$. In SMM, it is $\theta$. In a generative application, it may be a distributional simulator, but then the evaluation target should be stated explicitly. For example, in a dynamic discrete-choice model, a score-based simulator could generate latent shock histories conditional on observed state-action paths. The structural model would still impose Bellman inequalities, transition laws, and counterfactual policy restrictions. The learned score would be a computational device for exploring the latent space, not a substitute for the structural model. Similarly, in SMM, a score-based sampler could propose simulated datasets or latent variables that are closer to the empirical support than naive Monte Carlo draws. The SMM criterion would still be evaluated through moments, weights, and sampling error: $$ \hat\theta = \arg\min_\theta \left[ \hat m_{\text{data}} - \hat m_{\theta} \right]' W \left[ \hat m_{\text{data}} - \hat m_{\theta} \right]. $$ This is the serious reason to care about diffusion-style machinery in econometrics: not because the canonical demos generate faces, but because the same score-based machinery can act as an adaptive simulator, proposal distribution, or inverse-problem solver for objects that are hard to sample directly. ## References - Oliver Y. Feng, Yu-Chun Kao, Min Xu, and Richard J. Samworth, "Optimal convex M-estimation via score matching," arXiv:2403.16688, 2025 revision. - `asm`: "Optimal Convex M-Estimation for Linear Regression via Antitonic Score Matching," R package version 0.2.4. - Aapo Hyvarinen, "Estimation of Non-Normalized Statistical Models by Score Matching," 2005. - Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole, "Score-Based Generative Modeling through Stochastic Differential Equations," 2021.