Antitonic score projection, convex M-estimation, and score-based simulation
Problem
Score matching estimates derivatives of log densities without requiring a normalized likelihood. In regression, this can be used to build estimating equations that adapt to the unknown error distribution. Feng, Kao, Xu, and Samworth develop this idea for linear regression by constructing a data-driven convex loss whose empirical minimizer has optimal asymptotic covariance among convex M-estimators.
The object of interest is a regression model
\[
Y_i = X_i'\beta_0 + \epsilon_i,
\]
where the error distribution is unknown and may be heavy-tailed or non-Gaussian. The goal is to estimate \(\beta_0\) efficiently and robustly while preserving a convex optimization problem.
An econometrician can read the method as adaptive GMM with a special scalar transformation of the residual. Ordinary least squares uses the residual itself in the moment condition. Antitonic score matching learns a residual score from the data, imposes the shape restriction needed for a convex loss, and then uses that score in the estimating equation.
Scores and M-Estimation
Start from the ordinary statistical learning problem. We observe pairs \((Y_i,X_i)\) and want a predictor \(f_\theta(X_i)\) for the label \(Y_i\). A generic empirical risk estimator solves
Nothing in this formulation says that the loss must be squared error. Squared error, absolute error, quantile loss, logistic loss, and likelihood-based losses are all choices for \(\mathcal{L}\). The choice of loss determines what aspect of the conditional distribution is being targeted and how observations with large errors influence the estimate.
For the linear regression model, take
\[
f_\beta(X_i) = X_i'\beta.
\]
Many regression losses depend on the data only through the scalar residual
\[
r_i(\beta) = Y_i - X_i'\beta.
\]
This residual reduction gives the M-estimation form
OLS chooses \(\ell(r)=r^2/2\). LAD chooses \(\ell(r)=|r|\). Huber regression chooses a loss that is quadratic near zero and linear in the tails. These are different estimators because they choose different ways to price residuals.
Likelihood is another way to choose \(\ell\). Suppose the regression error has density \(p\) and
Conditional on \(X_i\), the density of \(Y_i\) under candidate \(\beta\) is
\[
p(Y_i-X_i'\beta)
=
p(r_i(\beta)).
\]
The log likelihood is
\[
\sum_{i=1}^n \log p(r_i(\beta)).
\]
Maximizing this likelihood is the same as minimizing the residual loss
\[
\ell_0(r) = -\log p(r).
\]
Thus L2 loss is not the starting point. It is the likelihood loss for Gaussian errors. Laplace errors give LAD. Cauchy errors give a nonconvex log loss. The antitonic score-matching method asks whether one can learn a good residual transformation from the data while retaining a convex optimization problem.
This is a derivative with respect to the scalar residual, but the residual is not the optimization variable. The parameter is still \(\beta\). The residual derivative appears by the chain rule when differentiating the likelihood with respect to \(\beta\):
Setting this to zero gives the likelihood estimating equation. The residual score is the scalar component of the parameter score; it is not a separate optimization over residuals.
It is convenient to write a general residual transformation as \(\psi\). If \(\ell\) is differentiable, define
Multiplying a just-identified moment by a nonzero constant does not change its root, so the estimating equation is equivalent to
\[
\sum_{i=1}^n X_i r_i(\beta),
\]
which gives the OLS normal equations. The Gaussian case is therefore a special case with a linear residual score, not the primitive object from which the rest of the theory starts.
The function \(\psi\) determines the estimating equation. It is not the Newton-Raphson update direction. A numerical Newton step uses derivatives of the sample objective to move \(\beta\) toward a solution; \(\psi\) is the residual transformation that defines the sample objective or moment condition in the first place.
Under standard regularity conditions and independence of \(X_i\) and \(\epsilon_i\),
The likelihood score \(\psi_0\) minimizes this scale factor over sufficiently regular scores. But there is a catch: if the induced loss \(\ell = -\log p\) is not convex, empirical risk minimization can be computationally awkward and can have undesirable local minima.
Relation to OLS and GMM
OLS solves the moment condition
\[
E[X_i(Y_i-X_i'\beta_0)] = 0.
\]
This is the same as using the score-like residual transformation \(\psi(u)=u\), up to the sign convention for \(\psi=-\ell'\). The squared-error loss therefore corresponds to a linear residual score.
Antitonic score matching replaces this linear residual score with a learned score:
\[
E[X_i\psi(Y_i-X_i'\beta_0)] = 0.
\]
So the estimator is a Z-estimator and can be written as an exactly identified GMM estimator. The novelty is not the existence of a moment condition. The novelty is the data-driven choice of \(\psi\), together with a monotonicity constraint that makes the implied loss convex.
For a fixed \(\psi\), the generic sandwich variance is
\[
A^{-1}BA^{-1},
\]
where
\[
A
=
E[-X_iX_i'\psi'(\epsilon_i)]
\]
and
\[
B
=
E[X_iX_i'\psi(\epsilon_i)^2].
\]
If \(\epsilon_i\) is independent of \(X_i\), this simplifies to
For OLS, \(\psi(u)=u\), so this is \(\sigma^2\{E(X_iX_i')\}^{-1}\) under homoskedasticity. With heteroskedasticity, the usual Huber-White covariance keeps the full meat
\[
E[X_iX_i'\epsilon_i^2].
\]
The antitonic score-matching variance is therefore not generally the OLS Huber-White variance. It uses \(\psi(\hat\epsilon_i)^2\) in the meat and the derivative of the learned score in the bread. It coincides with OLS only when the learned score is essentially linear. In heavy-tailed or non-log-concave settings, the point is precisely that the learned score differs from the residual.
Two Small DGPs
The following code uses the same notation as the previous section. The helper sandwich_for_score computes the exactly identified sandwich for a chosen residual score \(\psi\):
\[
\hat V
=
\hat A^{-1}\hat B\hat A^{-1}/n,
\]
where
\[
\hat A
=
-\frac{1}{n}\sum_{i=1}^n X_iX_i'\psi'(\hat\epsilon_i)
\]
and
\[
\hat B
=
\frac{1}{n}\sum_{i=1}^n X_iX_i'\psi(\hat\epsilon_i)^2.
\]
Code
import numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom trex import AntitonicScoreMatchingRegressiondef add_intercept(x):return np.column_stack([np.ones(len(x)), np.asarray(x)])def ols_fit(X, y): beta = np.linalg.lstsq(X, y, rcond=None)[0] resid = y - X @ beta n, k = X.shape xtx_inv = np.linalg.inv(X.T @ X) sigma2 = resid @ resid / (n - k) classical = sigma2 * xtx_inv meat = X.T @ (resid[:, None] **2* X) hc1 = (n / (n - k)) * xtx_inv @ meat @ xtx_invreturn beta, resid, classical, hc1def sandwich_for_score(X, resid, psi, psi_deriv, correction=True): n, k = X.shape score_values = psi(resid) deriv_values = psi_deriv(resid) A =-(X.T * deriv_values) @ X / n B = X.T @ (score_values[:, None] **2* X) / n vcov = np.linalg.pinv(A) @ B @ np.linalg.pinv(A) / nif correction: vcov *= n / (n - k)return vcovdef format_estimates(names, betas, vcovs): rows = []for name, beta, vcov inzip(names, betas, vcovs): se = np.sqrt(np.diag(vcov)) rows.append( {"estimator": name,"intercept": beta[0],"slope": beta[1],"se_intercept": se[0],"se_slope": se[1], } )return pd.DataFrame(rows).round(4)
With the squared-error loss \(\ell(u)=u^2/2\), the score convention above gives \(\psi(u)=-u\) because \(\psi=-\ell'\). Multiplying a just-identified moment by \(-1\) does not change its root, so this is OLS. The bread and meat in the generic sandwich also reduce to the usual Huber-White sandwich.
The two rows are the same because they solve the same normal equations. This is not a new estimator. It is OLS written as a residual-score moment condition.
The outliers are bad in two ways: the residuals are large and the covariates are high leverage. A residual score alone does not make this a high-breakdown robust regression problem. It does, however, change the estimating equation and the sandwich variance relative to OLS.
For comparison, fit a fixed Huber-score M-estimator and the learned antitonic-score estimator.
Code
def huber_fit(X, y, c=1.345, max_iter=200, tol=1e-10): beta = np.linalg.lstsq(X, y, rcond=None)[0]for _ inrange(max_iter): resid = y - X @ beta scale = np.median(np.abs(resid - np.median(resid))) /0.6745ifnot np.isfinite(scale) or scale <1e-8: scale = np.std(resid) if np.std(resid) >0else1.0 u = resid / scale weights = np.ones_like(u) mask = np.abs(u) > c weights[mask] = c / np.abs(u[mask]) beta_next = np.linalg.lstsq( X * np.sqrt(weights)[:, None], y * np.sqrt(weights), rcond=None, )[0]if np.linalg.norm(beta_next - beta) < tol * (1+ np.linalg.norm(beta)): beta = beta_nextbreak beta = beta_next resid = y - X @ beta scale = np.median(np.abs(resid - np.median(resid))) /0.6745ifnot np.isfinite(scale) or scale <1e-8: scale = np.std(resid) if np.std(resid) >0else1.0def psi(u):return-np.clip(u / scale, -c, c) * scaledef psi_deriv(u):return-(np.abs(u / scale) <= c).astype(float) vcov = sandwich_for_score(X, resid, psi=psi, psi_deriv=psi_deriv)return beta, resid, vcovhuber_beta, huber_resid, huber_vcov = huber_fit(X, y)asm = AntitonicScoreMatchingRegression( pilot="lad", alt_iter=2, k=1200, kernel_grid_size=2048, max_iter=200, tol=1e-6,).fit(x[:, None], y)asm_beta = asm.beta_asm_resid = y - X @ asm_betaasm_vcov = sandwich_for_score( X, asm_resid, psi=lambda u: asm.score_(u), psi_deriv=lambda u: asm.score_.derivative(u),)format_estimates( ["OLS", "Huber-score M", "antitonic score matching"], [ols_beta, huber_beta, asm_beta], [ols_hc1, huber_vcov, asm_vcov],)
estimator
intercept
slope
se_intercept
se_slope
0
OLS
0.6861
0.4108
0.0818
0.1372
1
Huber-score M
0.8793
1.5439
0.0470
0.1080
2
antitonic score matching
0.8985
1.7086
0.0358
0.0390
The robust-score estimators no longer solve the OLS normal equations. Their point estimates and standard errors differ because \(\psi(\hat\epsilon_i)\) is not proportional to \(\hat\epsilon_i\). In this draw, OLS is pulled strongly by the contaminated high-leverage observations; the bounded and learned scores reduce the influence of large residuals, though they do not eliminate the leverage problem.
The loss \(\ell\) is convex if and only if \(\ell'\) is increasing. Since \(\psi = -\ell'\), convexity is equivalent to \(\psi\) being decreasing. This turns the problem into a constrained score approximation problem:
where \(\Psi_\downarrow\) is the cone of decreasing score-like functions.
The paper’s key result is that this optimal convex M-estimation score can be obtained by score matching. In population, the score-matching objective is
Restricting \(\psi\) to the decreasing cone and minimizing \(D_p(\psi)\) yields the same projected score \(\psi_0^*\) that minimizes the asymptotic variance among convex M-estimators.
In this setting, score matching is a way to choose an estimating equation, not a generative-model training objective.
Antitonic Projection
The population construction is elegant. Let \(F\) be the CDF of the error distribution and define the density quantile function
\[
J(u) = p(F^{-1}(u)),
\qquad
u \in [0,1].
\]
Let \(\hat J\) be the least concave majorant of \(J\). The optimal decreasing projected score is the right derivative of this concave majorant, pulled back to the residual scale:
\[
\psi_0^*(z)
=
\hat J^{(R)}(F(z)).
\]
When the true score \(\psi_0\) already decreases, the error density is log-concave and the projection gives back the likelihood score. When the true score is not decreasing, the projection produces the closest decreasing score in the Fisher-divergence geometry relevant for the convex M-estimation problem.
For Cauchy errors, the likelihood score is nonmonotone enough to make the likelihood loss nonconvex. The projected score is bounded and Huber-like, so the resulting loss is convex and robust. The paper reports that, for standard Cauchy errors, this restriction to convex losses still retains high asymptotic efficiency relative to the oracle Cauchy maximum likelihood estimator.
Sample Algorithm
The practical algorithm has five parts.
Fit a pilot regression, typically LAD or OLS.
Form residuals from that pilot fit.
Smooth the residual distribution with a kernel density estimator.
Estimate the residual score and project it onto the decreasing cone using PAVA.
Fit the final regression by convex M-estimation using the negative antiderivative of the projected score.
In notation, with pilot residuals \(\hat\epsilon_i\), form a smoothed density estimate \(\tilde p_n\). The raw score estimate is
where \(M_\downarrow\) denotes the decreasing projection and \(\tilde F_n\) is the CDF associated with the smoothed density. Computationally, this is done by evaluating the density quantile function on a grid and running the pool-adjacent-violators algorithm.
For symmetric errors, one can symmetrize the score estimate:
Equivalently, it minimizes a convex empirical risk. In code this can be solved by Newton or damped Newton steps because the learned score supplies both the gradient and a piecewise derivative.
Inference
The antitonic information is
\[
i^*(p)
=
E[\psi_0^*(\epsilon)^2].
\]
For symmetric errors without a nuisance intercept complication, the asymptotic covariance is
The full paper develops cross-fitting and intercept handling to justify this asymptotically. Practical implementations can use the same plug-in logic while keeping the numerical algorithm simpler.
Minimal Software Form
A minimal software interface needs two components:
from trex import AntitonicScoreMatchingRegression, estimate_antitonic_score
The lower-level function estimate_antitonic_score takes residuals and returns a callable piecewise-linear score estimate. It:
computes a Gaussian KDE on a grid;
constructs the smoothed CDF;
evaluates the density quantile function;
projects the discrete derivative onto the decreasing cone with PAVA;
optionally symmetrizes the result.
The regression estimator then:
builds an intercept column if requested;
computes an OLS or LAD pilot;
estimates the antitonic residual score;
solves the final convex M-estimation equations by damped Newton steps;
This is narrower than the full R package asm. It omits the paper’s cross-fitted theoretical estimator, asymmetric intercept selection, diagnostic plots, and prediction intervals. It implements the central practical mechanism: data-driven convex robust regression from residual score matching.
Score-Based Simulation
Score matching also appears in modern generative modeling, including diffusion models. The connection is worth understanding because it is not intrinsically about producing pictures. The useful abstraction is a way to represent, perturb, and sample from complicated distributions through their score functions.
In the regression setting above, the score is a one-dimensional residual score used to choose an estimating equation. The output is \(\hat\beta\), standard errors, and confidence intervals. The score is a tool for efficient semiparametric estimation.
In score-based simulation, the score is usually a high-dimensional vector field
\[
s_\theta(x,t)
\approx
\nabla_x \log p_t(x),
\]
where \(p_t\) is a noise-corrupted version of a target distribution. The object \(x\) need not be an image. It can be a vector of covariates, a panel history, a vector of latent shocks, a discretized policy function, or simulated microdata satisfying structural restrictions.
The basic construction is:
Start from draws \(x_0\) from a target distribution.
Add noise through a known stochastic path, producing \(x_t\).
Train a model to estimate the score \(\nabla_x \log p_t(x)\) of the noised distribution.
Use the learned score to move draws from high noise back toward the target distribution.
A DDPM is one discretized version of this idea. It trains a model to predict the Gaussian noise added to \(x_0\). That noise prediction implicitly estimates the score because
residual score matching estimates a one-dimensional score to choose a robust estimating equation;
diffusion-style score modeling estimates a high-dimensional score to build a simulator;
both avoid normalized likelihoods and work directly with derivatives of log density.
The estimand determines whether this is useful. For statistical estimation, the score must improve an estimator or an inferential procedure. For simulation, the score must generate objects whose distributional or structural properties can be checked against a clear target. Without that target, score-based generation is just an impressive sampler.
Structural Econometrics Angle
The antitonic score-matching paper suggests a useful discipline for score-based structural work. The score model should be tied to a statistical target, not only to sample quality.
For simulated method of moments, the classical objective is
A score-based model can enter as an auxiliary simulator or proposal mechanism, but the target remains moment fit and inference on \(\theta\). The analogy to antitonic score matching is:
learn a score from data or residuals;
constrain it enough to preserve an estimation target;
use the learned score to improve optimization or simulation;
report uncertainty through the original estimand, not through visual fit alone.
For structural models, possible uses include:
robust residual-score estimation in linearized or local GMM steps;
score-guided draws of latent shocks conditional on observed choices;
score-based proposals for dynamic discrete-choice histories;
moment-guided reverse sampling where structural residuals guide the score path;
adaptive convex losses for simulated auxiliary regressions.
The useful discipline is to keep the score model attached to an estimand. In regression, that estimand is \(\beta_0\). In SMM, it is \(\theta\). In a generative application, it may be a distributional simulator, but then the evaluation target should be stated explicitly.
For example, in a dynamic discrete-choice model, a score-based simulator could generate latent shock histories conditional on observed state-action paths. The structural model would still impose Bellman inequalities, transition laws, and counterfactual policy restrictions. The learned score would be a computational device for exploring the latent space, not a substitute for the structural model.
Similarly, in SMM, a score-based sampler could propose simulated datasets or latent variables that are closer to the empirical support than naive Monte Carlo draws. The SMM criterion would still be evaluated through moments, weights, and sampling error:
This is the serious reason to care about diffusion-style machinery in econometrics: not because the canonical demos generate faces, but because the same score-based machinery can act as an adaptive simulator, proposal distribution, or inverse-problem solver for objects that are hard to sample directly.
References
Oliver Y. Feng, Yu-Chun Kao, Min Xu, and Richard J. Samworth, “Optimal convex M-estimation via score matching,” arXiv:2403.16688, 2025 revision.
asm: “Optimal Convex M-Estimation for Linear Regression via Antitonic Score Matching,” R package version 0.2.4.
Aapo Hyvarinen, “Estimation of Non-Normalized Statistical Models by Score Matching,” 2005.
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole, “Score-Based Generative Modeling through Stochastic Differential Equations,” 2021.
Source Code
---title: "Score Matching for Statistical Estimation and Inference"subtitle: "Antitonic score projection, convex M-estimation, and score-based simulation"format: html: embed-resources: true page-layout: full toc: true toc-depth: 3 code-fold: true code-tools: trueexecute: echo: true warning: false message: falsejupyter: python3---## ProblemScore matching estimates derivatives of log densities without requiring a normalized likelihood. In regression, this can be used to build estimating equations that adapt to the unknown error distribution. Feng, Kao, Xu, and Samworth develop this idea for linear regression by constructing a data-driven convex loss whose empirical minimizer has optimal asymptotic covariance among convex M-estimators.The object of interest is a regression model$$Y_i = X_i'\beta_0 + \epsilon_i,$$where the error distribution is unknown and may be heavy-tailed or non-Gaussian. The goal is to estimate $\beta_0$ efficiently and robustly while preserving a convex optimization problem.An econometrician can read the method as adaptive GMM with a special scalar transformation of the residual. Ordinary least squares uses the residual itself in the moment condition. Antitonic score matching learns a residual score from the data, imposes the shape restriction needed for a convex loss, and then uses that score in the estimating equation.## Scores and M-EstimationStart from the ordinary statistical learning problem. We observe pairs $(Y_i,X_i)$ and want a predictor $f_\theta(X_i)$ for the label $Y_i$. A generic empirical risk estimator solves$$\hat\theta\in\arg\min_\theta\frac{1}{n}\sum_{i=1}^n\mathcal{L}\{Y_i,f_\theta(X_i)\}.$$Nothing in this formulation says that the loss must be squared error. Squared error, absolute error, quantile loss, logistic loss, and likelihood-based losses are all choices for $\mathcal{L}$. The choice of loss determines what aspect of the conditional distribution is being targeted and how observations with large errors influence the estimate.For the linear regression model, take$$f_\beta(X_i) = X_i'\beta.$$Many regression losses depend on the data only through the scalar residual$$r_i(\beta) = Y_i - X_i'\beta.$$This residual reduction gives the M-estimation form$$\hat\beta\in\arg\min_\beta\frac{1}{n}\sum_{i=1}^n\ell(r_i(\beta)).$$OLS chooses $\ell(r)=r^2/2$. LAD chooses $\ell(r)=|r|$. Huber regression chooses a loss that is quadratic near zero and linear in the tails. These are different estimators because they choose different ways to price residuals.Likelihood is another way to choose $\ell$. Suppose the regression error has density $p$ and$$Y_i = X_i'\beta_0 + \epsilon_i,\qquad\epsilon_i \sim p.$$Conditional on $X_i$, the density of $Y_i$ under candidate $\beta$ is$$p(Y_i-X_i'\beta)=p(r_i(\beta)).$$The log likelihood is$$\sum_{i=1}^n \log p(r_i(\beta)).$$Maximizing this likelihood is the same as minimizing the residual loss$$\ell_0(r) = -\log p(r).$$Thus L2 loss is not the starting point. It is the likelihood loss for Gaussian errors. Laplace errors give LAD. Cauchy errors give a nonconvex log loss. The antitonic score-matching method asks whether one can learn a good residual transformation from the data while retaining a convex optimization problem.Now define the residual score$$\psi_0(r)=\frac{\partial}{\partial r}\log p(r)=\frac{p'(r)}{p(r)}.$$This is a derivative with respect to the scalar residual, but the residual is not the optimization variable. The parameter is still $\beta$. The residual derivative appears by the chain rule when differentiating the likelihood with respect to $\beta$:$$L(\beta)=\sum_{i=1}^n\log p(r_i(\beta)).$$$$\frac{\partial L(\beta)}{\partial \beta}=\sum_{i=1}^n\frac{\partial \log p(r_i)}{\partial r_i}\frac{\partial r_i(\beta)}{\partial \beta}.$$Since $\partial r_i(\beta)/\partial\beta=-X_i$, the parameter score is$$\frac{\partial L(\beta)}{\partial \beta}=-\sum_{i=1}^nX_i\frac{p'(r_i(\beta))}{p(r_i(\beta))}.$$Setting this to zero gives the likelihood estimating equation. The residual score is the scalar component of the parameter score; it is not a separate optimization over residuals.It is convenient to write a general residual transformation as $\psi$. If $\ell$ is differentiable, define$$\psi(r) = -\ell'(r).$$Then the M-estimator solves$$\frac{1}{n}\sum_{i=1}^nX_i\psi(Y_i-X_i'\hat\beta_\psi)=0.$$For Gaussian errors,$$\ell(r)=\frac{r^2}{2\sigma^2},\qquad\psi(r)=-\frac{r}{\sigma^2}.$$Multiplying a just-identified moment by a nonzero constant does not change its root, so the estimating equation is equivalent to$$\sum_{i=1}^n X_i r_i(\beta),$$which gives the OLS normal equations. The Gaussian case is therefore a special case with a linear residual score, not the primitive object from which the rest of the theory starts.The function $\psi$ determines the estimating equation. It is not the Newton-Raphson update direction. A numerical Newton step uses derivatives of the sample objective to move $\beta$ toward a solution; $\psi$ is the residual transformation that defines the sample objective or moment condition in the first place.Under standard regularity conditions and independence of $X_i$ and $\epsilon_i$,$$\sqrt{n}(\hat\beta_\psi - \beta_0)\RightarrowN\left( 0, V_p(\psi)\{E(X_iX_i')\}^{-1}\right),$$with$$V_p(\psi)=\frac{E[\psi(\epsilon)^2]}{E[\psi'(\epsilon)]^2}.$$The likelihood score $\psi_0$ minimizes this scale factor over sufficiently regular scores. But there is a catch: if the induced loss $\ell = -\log p$ is not convex, empirical risk minimization can be computationally awkward and can have undesirable local minima.## Relation to OLS and GMMOLS solves the moment condition$$E[X_i(Y_i-X_i'\beta_0)] = 0.$$This is the same as using the score-like residual transformation $\psi(u)=u$, up to the sign convention for $\psi=-\ell'$. The squared-error loss therefore corresponds to a linear residual score.Antitonic score matching replaces this linear residual score with a learned score:$$E[X_i\psi(Y_i-X_i'\beta_0)] = 0.$$So the estimator is a Z-estimator and can be written as an exactly identified GMM estimator. The novelty is not the existence of a moment condition. The novelty is the data-driven choice of $\psi$, together with a monotonicity constraint that makes the implied loss convex.For a fixed $\psi$, the generic sandwich variance is$$A^{-1}BA^{-1},$$where$$A=E[-X_iX_i'\psi'(\epsilon_i)]$$and$$B=E[X_iX_i'\psi(\epsilon_i)^2].$$If $\epsilon_i$ is independent of $X_i$, this simplifies to$$A=E(X_iX_i')E[-\psi'(\epsilon_i)]$$and$$B=E(X_iX_i')E[\psi(\epsilon_i)^2].$$The asymptotic covariance becomes$$\frac{E[\psi(\epsilon)^2]}{E[-\psi'(\epsilon)]^2}\{E(X_iX_i')\}^{-1}.$$For OLS, $\psi(u)=u$, so this is $\sigma^2\{E(X_iX_i')\}^{-1}$ under homoskedasticity. With heteroskedasticity, the usual Huber-White covariance keeps the full meat$$E[X_iX_i'\epsilon_i^2].$$The antitonic score-matching variance is therefore not generally the OLS Huber-White variance. It uses $\psi(\hat\epsilon_i)^2$ in the meat and the derivative of the learned score in the bread. It coincides with OLS only when the learned score is essentially linear. In heavy-tailed or non-log-concave settings, the point is precisely that the learned score differs from the residual.## Two Small DGPsThe following code uses the same notation as the previous section. The helper `sandwich_for_score` computes the exactly identified sandwich for a chosen residual score $\psi$:$$\hat V=\hat A^{-1}\hat B\hat A^{-1}/n,$$where$$\hat A=-\frac{1}{n}\sum_{i=1}^n X_iX_i'\psi'(\hat\epsilon_i)$$and$$\hat B=\frac{1}{n}\sum_{i=1}^n X_iX_i'\psi(\hat\epsilon_i)^2.$$```{python}#| label: score-helper-code#| code-fold: trueimport numpy as npimport pandas as pdimport matplotlib.pyplot as pltfrom trex import AntitonicScoreMatchingRegressiondef add_intercept(x):return np.column_stack([np.ones(len(x)), np.asarray(x)])def ols_fit(X, y): beta = np.linalg.lstsq(X, y, rcond=None)[0] resid = y - X @ beta n, k = X.shape xtx_inv = np.linalg.inv(X.T @ X) sigma2 = resid @ resid / (n - k) classical = sigma2 * xtx_inv meat = X.T @ (resid[:, None] **2* X) hc1 = (n / (n - k)) * xtx_inv @ meat @ xtx_invreturn beta, resid, classical, hc1def sandwich_for_score(X, resid, psi, psi_deriv, correction=True): n, k = X.shape score_values = psi(resid) deriv_values = psi_deriv(resid) A =-(X.T * deriv_values) @ X / n B = X.T @ (score_values[:, None] **2* X) / n vcov = np.linalg.pinv(A) @ B @ np.linalg.pinv(A) / nif correction: vcov *= n / (n - k)return vcovdef format_estimates(names, betas, vcovs): rows = []for name, beta, vcov inzip(names, betas, vcovs): se = np.sqrt(np.diag(vcov)) rows.append( {"estimator": name,"intercept": beta[0],"slope": beta[1],"se_intercept": se[0],"se_slope": se[1], } )return pd.DataFrame(rows).round(4)```### Linear Score: OLS ExactlyTake a clean homoskedastic Gaussian DGP:$$X_i \sim N(0,1),\qquadY_i = 1 + 2X_i + \epsilon_i,\qquad\epsilon_i \sim N(0,1).$$With the squared-error loss $\ell(u)=u^2/2$, the score convention above gives $\psi(u)=-u$ because $\psi=-\ell'$. Multiplying a just-identified moment by $-1$ does not change its root, so this is OLS. The bread and meat in the generic sandwich also reduce to the usual Huber-White sandwich.```{python}#| label: clean-linear-score-example#| code-fold: truerng = np.random.default_rng(20260525)n =500x = rng.normal(size=n)X = add_intercept(x)y =1.0+2.0* x + rng.normal(size=n)ols_beta, ols_resid, ols_classical, ols_hc1 = ols_fit(X, y)linear_beta = np.linalg.solve(X.T @ X, X.T @ y)linear_resid = y - X @ linear_betalinear_hc1 = sandwich_for_score( X, linear_resid, psi=lambda u: -u, psi_deriv=lambda u: -np.ones_like(u),)format_estimates( ["OLS", "linear-score GMM"], [ols_beta, linear_beta], [ols_hc1, linear_hc1],)```The two rows are the same because they solve the same normal equations. This is not a new estimator. It is OLS written as a residual-score moment condition.```{python}#| label: clean-linear-score-difference#| code-fold: truepd.DataFrame( {"max_abs_beta_difference": [np.max(np.abs(ols_beta - linear_beta))],"max_abs_hc1_se_difference": [ np.max(np.abs(np.sqrt(np.diag(ols_hc1)) - np.sqrt(np.diag(linear_hc1)))) ], })```### Huber Contamination with Leverage OutliersNow use a contamination DGP:$$(X_i,\epsilon_i)\sim(1-\eta)F_0 + \eta G,\qquad\eta = 0.04.$$For the clean component,$$X_i \sim N(0,1),\qquad\epsilon_i \sim 0.7t_3.$$For the contaminated component,$$X_i \sim N(6,0.35^2),\qquad\epsilon_i \sim N(-16,2.5^2).$$The outliers are bad in two ways: the residuals are large and the covariates are high leverage. A residual score alone does not make this a high-breakdown robust regression problem. It does, however, change the estimating equation and the sandwich variance relative to OLS.```{python}#| label: contaminated-dgp-example#| code-fold: truerng = np.random.default_rng(20260525)n =600eta =0.04is_bad = rng.uniform(size=n) < etax = np.empty(n)eps = np.empty(n)x[~is_bad] = rng.normal(size=(~is_bad).sum())eps[~is_bad] =0.7* rng.standard_t(df=3, size=(~is_bad).sum())x[is_bad] = rng.normal(loc=6.0, scale=0.35, size=is_bad.sum())eps[is_bad] = rng.normal(loc=-16.0, scale=2.5, size=is_bad.sum())X = add_intercept(x)y =1.0+2.0* x + epsols_beta, ols_resid, _, ols_hc1 = ols_fit(X, y)```For comparison, fit a fixed Huber-score M-estimator and the learned antitonic-score estimator.```{python}#| label: robust-score-fit-code#| code-fold: truedef huber_fit(X, y, c=1.345, max_iter=200, tol=1e-10): beta = np.linalg.lstsq(X, y, rcond=None)[0]for _ inrange(max_iter): resid = y - X @ beta scale = np.median(np.abs(resid - np.median(resid))) /0.6745ifnot np.isfinite(scale) or scale <1e-8: scale = np.std(resid) if np.std(resid) >0else1.0 u = resid / scale weights = np.ones_like(u) mask = np.abs(u) > c weights[mask] = c / np.abs(u[mask]) beta_next = np.linalg.lstsq( X * np.sqrt(weights)[:, None], y * np.sqrt(weights), rcond=None, )[0]if np.linalg.norm(beta_next - beta) < tol * (1+ np.linalg.norm(beta)): beta = beta_nextbreak beta = beta_next resid = y - X @ beta scale = np.median(np.abs(resid - np.median(resid))) /0.6745ifnot np.isfinite(scale) or scale <1e-8: scale = np.std(resid) if np.std(resid) >0else1.0def psi(u):return-np.clip(u / scale, -c, c) * scaledef psi_deriv(u):return-(np.abs(u / scale) <= c).astype(float) vcov = sandwich_for_score(X, resid, psi=psi, psi_deriv=psi_deriv)return beta, resid, vcovhuber_beta, huber_resid, huber_vcov = huber_fit(X, y)asm = AntitonicScoreMatchingRegression( pilot="lad", alt_iter=2, k=1200, kernel_grid_size=2048, max_iter=200, tol=1e-6,).fit(x[:, None], y)asm_beta = asm.beta_asm_resid = y - X @ asm_betaasm_vcov = sandwich_for_score( X, asm_resid, psi=lambda u: asm.score_(u), psi_deriv=lambda u: asm.score_.derivative(u),)format_estimates( ["OLS", "Huber-score M", "antitonic score matching"], [ols_beta, huber_beta, asm_beta], [ols_hc1, huber_vcov, asm_vcov],)```The robust-score estimators no longer solve the OLS normal equations. Their point estimates and standard errors differ because $\psi(\hat\epsilon_i)$ is not proportional to $\hat\epsilon_i$. In this draw, OLS is pulled strongly by the contaminated high-leverage observations; the bounded and learned scores reduce the influence of large residuals, though they do not eliminate the leverage problem.```{python}#| label: contaminated-dgp-plot#| code-fold: true#| fig-width: 10#| fig-height: 5grid = np.linspace(x.min() -0.5, x.max() +0.5, 200)X_grid = add_intercept(grid)fig, ax = plt.subplots()ax.scatter(x[~is_bad], y[~is_bad], s=16, alpha=0.55, label="clean draws")ax.scatter(x[is_bad], y[is_bad], s=28, alpha=0.9, label="contaminated draws")ax.plot(grid, X_grid @ ols_beta, linewidth=2, label="OLS")ax.plot(grid, X_grid @ huber_beta, linewidth=2, label="Huber-score M")ax.plot(grid, X_grid @ asm_beta, linewidth=2, label="antitonic score matching")ax.plot(grid, 1.0+2.0* grid, color="black", linestyle="--", linewidth=1.5, label="clean target")ax.set_xlabel("x")ax.set_ylabel("y")ax.legend(loc="best")ax.set_title("Leverage contamination changes the residual-score estimating equation")plt.show()```## Convex Losses and Decreasing ScoresThe loss $\ell$ is convex if and only if $\ell'$ is increasing. Since $\psi = -\ell'$, convexity is equivalent to $\psi$ being decreasing. This turns the problem into a constrained score approximation problem:$$\psi_0^*\in\arg\min_{\psi \in \Psi_\downarrow}V_p(\psi),$$where $\Psi_\downarrow$ is the cone of decreasing score-like functions.The paper's key result is that this optimal convex M-estimation score can be obtained by score matching. In population, the score-matching objective is$$D_p(\psi)=E[\psi(\epsilon)^2 + 2\psi'(\epsilon)].$$Restricting $\psi$ to the decreasing cone and minimizing $D_p(\psi)$ yields the same projected score $\psi_0^*$ that minimizes the asymptotic variance among convex M-estimators.In this setting, score matching is a way to choose an estimating equation, not a generative-model training objective.## Antitonic ProjectionThe population construction is elegant. Let $F$ be the CDF of the error distribution and define the density quantile function$$J(u) = p(F^{-1}(u)),\qquadu \in [0,1].$$Let $\hat J$ be the least concave majorant of $J$. The optimal decreasing projected score is the right derivative of this concave majorant, pulled back to the residual scale:$$\psi_0^*(z)=\hat J^{(R)}(F(z)).$$When the true score $\psi_0$ already decreases, the error density is log-concave and the projection gives back the likelihood score. When the true score is not decreasing, the projection produces the closest decreasing score in the Fisher-divergence geometry relevant for the convex M-estimation problem.For Cauchy errors, the likelihood score is nonmonotone enough to make the likelihood loss nonconvex. The projected score is bounded and Huber-like, so the resulting loss is convex and robust. The paper reports that, for standard Cauchy errors, this restriction to convex losses still retains high asymptotic efficiency relative to the oracle Cauchy maximum likelihood estimator.## Sample AlgorithmThe practical algorithm has five parts.1. Fit a pilot regression, typically LAD or OLS.2. Form residuals from that pilot fit.3. Smooth the residual distribution with a kernel density estimator.4. Estimate the residual score and project it onto the decreasing cone using PAVA.5. Fit the final regression by convex M-estimation using the negative antiderivative of the projected score.In notation, with pilot residuals $\hat\epsilon_i$, form a smoothed density estimate $\tilde p_n$. The raw score estimate is$$\tilde\psi_n(z)=\frac{\tilde p_n'(z)}{\tilde p_n(z)}.$$The antitonic score estimate is$$\hat\psi_n=M_\downarrow(\tilde\psi_n \circ \tilde F_n^{-1}) \circ \tilde F_n,$$where $M_\downarrow$ denotes the decreasing projection and $\tilde F_n$ is the CDF associated with the smoothed density. Computationally, this is done by evaluating the density quantile function on a grid and running the pool-adjacent-violators algorithm.For symmetric errors, one can symmetrize the score estimate:$$\hat\psi_n^{\text{sym}}(z)=\frac{\hat\psi_n(z) - \hat\psi_n(-z)}{2}.$$The final estimator solves$$\sum_{i=1}^nX_i\hat\psi_n(Y_i - X_i'\beta)= 0.$$Equivalently, it minimizes a convex empirical risk. In code this can be solved by Newton or damped Newton steps because the learned score supplies both the gradient and a piecewise derivative.## InferenceThe antitonic information is$$i^*(p)=E[\psi_0^*(\epsilon)^2].$$For symmetric errors without a nuisance intercept complication, the asymptotic covariance is$$\frac{1}{n}\left[ i^*(p) E(X_iX_i')\right]^{-1}.$$The natural plug-in estimator is$$\hat i=\frac{1}{n}\sum_{i=1}^n\hat\psi_n(\hat\epsilon_i)^2,$$and$$\hat I=\hat i\frac{X'X}{n}.$$Then a Wald interval for coordinate $j$ is$$\hat\beta_j\pmz_{1-\alpha/2}\sqrt{ \frac{(\hat I^{-1})_{jj}}{n}}.$$The full paper develops cross-fitting and intercept handling to justify this asymptotically. Practical implementations can use the same plug-in logic while keeping the numerical algorithm simpler.## Minimal Software FormA minimal software interface needs two components:```pythonfrom trex import AntitonicScoreMatchingRegression, estimate_antitonic_score```The lower-level function `estimate_antitonic_score` takes residuals and returns a callable piecewise-linear score estimate. It:- computes a Gaussian KDE on a grid;- constructs the smoothed CDF;- evaluates the density quantile function;- projects the discrete derivative onto the decreasing cone with PAVA;- optionally symmetrizes the result.The regression estimator then:- builds an intercept column if requested;- computes an OLS or LAD pilot;- estimates the antitonic residual score;- solves the final convex M-estimation equations by damped Newton steps;- stores coefficients, residuals, fitted values, antitonic information, covariance estimates, and Wald intervals.This is narrower than the full R package `asm`. It omits the paper's cross-fitted theoretical estimator, asymmetric intercept selection, diagnostic plots, and prediction intervals. It implements the central practical mechanism: data-driven convex robust regression from residual score matching.## Score-Based SimulationScore matching also appears in modern generative modeling, including diffusion models. The connection is worth understanding because it is not intrinsically about producing pictures. The useful abstraction is a way to represent, perturb, and sample from complicated distributions through their score functions.In the regression setting above, the score is a one-dimensional residual score used to choose an estimating equation. The output is $\hat\beta$, standard errors, and confidence intervals. The score is a tool for efficient semiparametric estimation.In score-based simulation, the score is usually a high-dimensional vector field$$s_\theta(x,t)\approx\nabla_x \log p_t(x),$$where $p_t$ is a noise-corrupted version of a target distribution. The object $x$ need not be an image. It can be a vector of covariates, a panel history, a vector of latent shocks, a discretized policy function, or simulated microdata satisfying structural restrictions.The basic construction is:1. Start from draws $x_0$ from a target distribution.2. Add noise through a known stochastic path, producing $x_t$.3. Train a model to estimate the score $\nabla_x \log p_t(x)$ of the noised distribution.4. Use the learned score to move draws from high noise back toward the target distribution.A DDPM is one discretized version of this idea. It trains a model to predict the Gaussian noise added to $x_0$. That noise prediction implicitly estimates the score because$$\nabla_{x_t}\log q(x_t \mid x_0)=-\frac{\epsilon}{\sqrt{1-\bar\alpha_t}}.$$The statistical analogy is:- residual score matching estimates a one-dimensional score to choose a robust estimating equation;- diffusion-style score modeling estimates a high-dimensional score to build a simulator;- both avoid normalized likelihoods and work directly with derivatives of log density.The estimand determines whether this is useful. For statistical estimation, the score must improve an estimator or an inferential procedure. For simulation, the score must generate objects whose distributional or structural properties can be checked against a clear target. Without that target, score-based generation is just an impressive sampler.## Structural Econometrics AngleThe antitonic score-matching paper suggests a useful discipline for score-based structural work. The score model should be tied to a statistical target, not only to sample quality.For simulated method of moments, the classical objective is$$Q(\theta)=\left[ m_{\text{data}} - E_\theta m(Y)\right]'W\left[ m_{\text{data}} - E_\theta m(Y)\right].$$A score-based model can enter as an auxiliary simulator or proposal mechanism, but the target remains moment fit and inference on $\theta$. The analogy to antitonic score matching is:- learn a score from data or residuals;- constrain it enough to preserve an estimation target;- use the learned score to improve optimization or simulation;- report uncertainty through the original estimand, not through visual fit alone.For structural models, possible uses include:- robust residual-score estimation in linearized or local GMM steps;- score-guided draws of latent shocks conditional on observed choices;- score-based proposals for dynamic discrete-choice histories;- moment-guided reverse sampling where structural residuals guide the score path;- adaptive convex losses for simulated auxiliary regressions.The useful discipline is to keep the score model attached to an estimand. In regression, that estimand is $\beta_0$. In SMM, it is $\theta$. In a generative application, it may be a distributional simulator, but then the evaluation target should be stated explicitly.For example, in a dynamic discrete-choice model, a score-based simulator could generate latent shock histories conditional on observed state-action paths. The structural model would still impose Bellman inequalities, transition laws, and counterfactual policy restrictions. The learned score would be a computational device for exploring the latent space, not a substitute for the structural model.Similarly, in SMM, a score-based sampler could propose simulated datasets or latent variables that are closer to the empirical support than naive Monte Carlo draws. The SMM criterion would still be evaluated through moments, weights, and sampling error:$$\hat\theta=\arg\min_\theta\left[ \hat m_{\text{data}} - \hat m_{\theta}\right]'W\left[ \hat m_{\text{data}} - \hat m_{\theta}\right].$$This is the serious reason to care about diffusion-style machinery in econometrics: not because the canonical demos generate faces, but because the same score-based machinery can act as an adaptive simulator, proposal distribution, or inverse-problem solver for objects that are hard to sample directly.## References- Oliver Y. Feng, Yu-Chun Kao, Min Xu, and Richard J. Samworth, "Optimal convex M-estimation via score matching," arXiv:2403.16688, 2025 revision.- `asm`: "Optimal Convex M-Estimation for Linear Regression via Antitonic Score Matching," R package version 0.2.4.- Aapo Hyvarinen, "Estimation of Non-Normalized Statistical Models by Score Matching," 2005.- Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole, "Score-Based Generative Modeling through Stochastic Differential Equations," 2021.