DDPM Diffusion for Tabular Rows and Structural Econometrics

A design note for Trex simulation DGPs and corrupt-and-undo estimators

Purpose

This note spells out the DDPM-style diffusion model currently used for tabular rows in Trex and sketches how the same corrupt-and-undo structure might become useful for structural econometric models. The immediate object is a simulation DGP: learn the distribution of observed rows and draw synthetic rows. The more interesting research question is whether diffusion gives a useful computational pattern for fitting models whose likelihoods, equilibrium mappings, or dynamic programs are awkward.

The key idea is simple:

Take a clean object $x_0$.
Corrupt it through a known stochastic path until it is close to noise.
Train a network to undo one corruption step, or to predict the noise that was added.
Use the trained denoiser as a simulator, sampler, or regularized inverse map.

For tabular simulation, $x_0$ is a row of standardized covariates and outcomes. For structural models, $x_0$ could be a latent utility shock, a vector of actions and states, a policy object, a value-function residual, or a latent type.

This note follows the score-based perspective in Yang Song’s 2021 overview, Generative Modeling by Estimating Gradients of the Data Distribution. That post is useful here because it strips diffusion models down to the central object: learn the score $\nabla_x \log p_t(x)$ of progressively noise-perturbed data distributions, then use that learned vector field to move samples from noise back toward data.

Score-Based Fundamentals

The score of a density $p(x)$ is

\[ s(x) = \nabla_x \log p(x). \]

It points in the local direction of increasing log density. A score model $s_\theta(x)$ tries to approximate this vector field without requiring a normalized likelihood. This is the key computational advantage over a generic likelihood model: the score is insensitive to the unknown normalizing constant.

If a good score model is available, Langevin dynamics can sample from the corresponding distribution using only the score:

\[ x_{k+1} = x_k + \eta s_\theta(x_k) + \sqrt{2\eta} z_k, \qquad z_k \sim \mathcal{N}(0,I). \]

In a high-dimensional problem, directly estimating the score of the raw data distribution is hard. Data often live near a lower-dimensional set, and score matching puts most weight where the data density is already high. The score can be poor in low-density regions, exactly where a sampler initialized from noise begins. The score-based solution is to train on many noise-perturbed versions of the data. Large noise levels smooth the distribution and make global directions easier to learn; small noise levels recover local detail.

With finite noise levels $\sigma_1 < \cdots < \sigma_L$, the model estimates

\[ s_\theta(x,\sigma_\ell) \approx \nabla_x \log p_{\sigma_\ell}(x), \]

where $p_{\sigma_\ell}$ is the data distribution convolved with Gaussian noise. Sampling then starts at the highest noise level and anneals down. This is annealed Langevin dynamics:

\[ x_{k+1} = x_k + \eta_\ell s_\theta(x_k,\sigma_\ell) + \sqrt{2\eta_\ell}z_k, \qquad \ell = L,L-1,\ldots,1. \]

Continuous-time score-based models replace the finite noise ladder with an SDE:

\[ dx = f(x,t)dt + g(t)dw. \]

The forward SDE maps data at $t=0$ into a tractable noise distribution at $t=T$. The reverse-time SDE is

\[ dx = \left[ f(x,t) - g(t)^2 \nabla_x \log p_t(x) \right]dt + g(t)d\bar w, \]

solved backward from $T$ to $0$. Once $\nabla_x \log p_t(x)$ is replaced by $s_\theta(x,t)$, this becomes a generative sampler. The same marginals can also be generated by the probability-flow ODE:

\[ dx = \left[ f(x,t) - \frac{1}{2}g(t)^2 \nabla_x \log p_t(x) \right]dt. \]

The ODE view matters because it connects score models to continuous normalizing flows and likelihood computation. The SDE view matters because it connects directly to stochastic sampling, inverse problems, and constraint-guided generation.

Current Trex Implementation

The relevant Trex file is trex/simdgp.py. The diffusion path is deliberately small:

TabularTransformer: standardizes continuous variables, preserves binary columns, and inverts generated rows back to data scale.
_TimeEmbedding: maps integer diffusion times into sinusoidal features.
_Denoiser: neural network that receives a noisy row, a time embedding, and optional context.
TabularDiffusion: estimator interface with fit, sample, schedule setup, and context handling.
distribution_metrics: diagnostic function for generated versus real samples.

The current implementation is a compact DDPM. It is not a full diffusion framework yet: there are no callbacks, no pluggable schedules, no constraint projections beyond postprocessing, and no structural residual losses. That is the right starting point for a simulation-DGP primitive. The research extensions below describe what a richer version could look like.

Data Representation

Let a tabular row be $x \in \mathbb{R}^d$. Trex uses a preprocessing map $S$:

\[ z = S(x). \]

For continuous columns, $S$ subtracts the empirical mean and divides by the empirical standard deviation. For binary columns, $S$ leaves the value on the $0/1$ scale. This keeps the neural model from being dominated by earnings columns while still making binary treatment and demographic columns interpretable.

Generation works on standardized rows $z$. After sampling, Trex applies $S^{-1}$ and then:

rounds binary columns to $0/1$;
clips nonnegative columns at zero.

This is a practical compromise. A richer tabular diffusion model might use discrete diffusion for binary columns, monotone transforms for skewed earnings, or separate heads by column type. The current version keeps the model simple and makes the benchmark easy to audit.

Forward Corruption Process

DDPM starts with a fixed noising schedule. Choose positive numbers

\[ \beta_1,\ldots,\beta_T, \]

with small $\beta_t$ early and larger $\beta_t$ later. Define

\[ \alpha_t = 1 - \beta_t, \qquad \bar \alpha_t = \prod_{s=1}^{t}\alpha_s. \]

For a clean standardized row $x_0$, the forward process has the closed form

\[ x_t = \sqrt{\bar \alpha_t}x_0 + \sqrt{1-\bar \alpha_t}\epsilon, \qquad \epsilon \sim \mathcal{N}(0,I). \]

This is useful because we do not need to simulate every earlier step during training. We can sample a time $t$, sample noise $\epsilon$, construct $x_t$ directly, and train the model to recover $\epsilon$.

In Trex this appears in TabularDiffusion.fit:

t = torch.randint(0, self.n_timesteps, (x0.shape[0],), device=self.device)
noise = torch.randn_like(x0)
sqrt_alpha_bar = self.sqrt_alpha_bars[t].unsqueeze(1)
sqrt_one_minus = self.sqrt_one_minus_alpha_bars[t].unsqueeze(1)
x_t = sqrt_alpha_bar * x0 + sqrt_one_minus * noise
pred_noise = self.denoiser(x_t, t, ctx)
loss = F.mse_loss(pred_noise, noise)

The learning problem is therefore:

\[ \min_\theta \mathbb{E}_{x_0,t,\epsilon} \left[ \lVert \epsilon - \epsilon_\theta(x_t,t,c) \rVert_2^2 \right], \]

where $c$ is optional context. In an unconditional DGP, $c$ is empty. In a conditional DGP, $c$ might be treatment assignment, market, cohort, state, or any other conditioning variable.

This noise-prediction objective is also a score-estimation objective. Conditional on $x_0$, the Gaussian perturbation density has score

\[ \nabla_{x_t}\log q(x_t \mid x_0) = -\frac{x_t - \sqrt{\bar \alpha_t}x_0}{1-\bar \alpha_t} = -\frac{\epsilon}{\sqrt{1-\bar \alpha_t}}. \]

If the network predicts $\epsilon_\theta(x_t,t,c)$, it implicitly gives a score estimate

\[ s_\theta(x_t,t,c) \approx -\frac{\epsilon_\theta(x_t,t,c)}{\sqrt{1-\bar \alpha_t}}. \]

That is the direct bridge between the compact DDPM implementation in Trex and the broader score-based framework. The code trains a denoiser, but the denoiser is learning the vector field that points noisy rows back toward high-density regions of the data distribution.

Reverse Sampling Process

At sampling time, start from Gaussian noise:

\[ x_T \sim \mathcal{N}(0,I). \]

Then run the denoiser backward:

\[ \mu_\theta(x_t,t,c) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar \alpha_t}} \epsilon_\theta(x_t,t,c) \right). \]

Trex uses the simple DDPM reverse step:

\[ x_{t-1} = \mu_\theta(x_t,t,c) + \sqrt{\beta_t}\eta_t, \qquad \eta_t \sim \mathcal{N}(0,I), \]

except at the final step, where no new noise is added. In code:

for step in reversed(range(self.n_timesteps)):
    t = torch.full((n,), step, device=self.device, dtype=torch.long)
    beta = self.betas[t].unsqueeze(1)
    alpha = self.alphas[t].unsqueeze(1)
    alpha_bar = self.alpha_bars[t].unsqueeze(1)
    pred_noise = self.denoiser(x, t, c)
    mean = (x - beta / torch.sqrt(1 - alpha_bar) * pred_noise) / torch.sqrt(alpha)
    if step > 0:
        x = mean + torch.sqrt(beta) * torch.randn_like(x)
    else:
        x = mean

The current sampler is intentionally literal. Possible upgrades are deterministic DDIM sampling, learned variance, predictor-corrector steps, or constraint-aware projections after each step.

Structured Objects: Images Versus Tables

Diffusion became famous on images because images have strong structure that neural networks can exploit:

pixels live on a grid;
nearby pixels have local dependence;
patterns repeat under translation;
coarse-to-fine semantic structure matches the high-noise to low-noise sampling path;
U-Net architectures and convolutional residual blocks encode useful inductive bias.

Images are also usually treated as continuous after dequantization. This makes Gaussian corruption mathematically natural even when the original pixel values are discrete integers.

Tabular rows are structured in a different way. A row has columns, not spatial locations. Column order is mostly arbitrary. Some columns are continuous, some are binary, some are counts, and some are categorical. Constraints can be hard: treatment must be $0/1$, earnings may be nonnegative, education is integer-valued, and logically impossible combinations should be excluded. The relevant structure is not locality on a grid, but joint dependence, support constraints, conditional moments, tail behavior, and known economic restrictions.

The present Lalonde application is therefore a modest but useful tabular test. The target object is one standardized row containing treatment, demographics, and earnings outcomes. The DDPM is not trying to create a realistic person in the rich image-generation sense. It is trying to learn a sampler whose generated rows preserve marginal distributions, pairwise dependence, and treatment-outcome relationships better than simpler baselines.

For structural econometrics, the structured object may be even less image-like:

a vector of latent taste shocks;
a panel path of states, actions, and outcomes;
a vector of simulated moments;
a discretized policy or value function;
a sequence of equilibrium residuals;
a joint object $(\theta, u, y)$ containing parameters, latent variables, and observables.

That is why the diffusion abstraction should be kept modular. The useful part is not the image architecture. The useful part is the corruption path, learned reverse map, and ability to inject restrictions during sampling.

Class and Method Map

`TabularTransformer`

Responsibilities:

infer column names from a pandas data frame when available;
identify binary, continuous, and nonnegative columns;
compute means and standard deviations;
transform raw rows into training scale;
inverse-transform generated rows into raw scale;
round binary outputs and clip nonnegative outputs.

Important methods:

fit(data): stores column metadata and scaling statistics.
transform(data): returns standardized numeric rows.
fit_transform(data): convenience method.
inverse_transform(values, sample_binary=False, random_state=None): returns raw-scale rows.
transformed_bounds(): gives transformed lower and upper bounds for constrained generators.

For diffusion, TabularTransformer is not part of the neural model. It is part of the statistical contract: the model sees numerically stable rows, and the user gets valid rows back.

`_TimeEmbedding`

Responsibilities:

map an integer time step $t$ into a vector of sinusoidal features;
give the denoiser a smooth representation of diffusion time.

The embedding is analogous to transformer positional encodings:

\[ e(t) = (\sin(t\omega_1),\ldots,\sin(t\omega_k), \cos(t\omega_1),\ldots,\cos(t\omega_k)). \]

This lets a single denoising network learn different behavior at low-noise and high-noise steps.

`_Denoiser`

Responsibilities:

concatenate noisy row $x_t$, optional context $c$, and time embedding $e(t)$;
predict the Gaussian noise $\epsilon$ added to the row.

Current architecture:

[x_t, context, time_embedding] -> MLP -> predicted_noise

This is deliberately boring. For tabular rows, a small MLP is often enough to test the idea. Richer architectures could add:

separate continuous and discrete heads;
residual blocks;
monotonicity constraints;
cross-feature attention;
group-specific embeddings;
score heads tied to moment restrictions.

`TabularDiffusion`

Responsibilities:

own the diffusion schedule;
own the denoising network;
implement the estimator lifecycle;
expose fit and sample.

Constructor parameters:

hidden_dims: MLP widths;
time_dim: dimension of the time embedding;
n_timesteps: number of diffusion steps;
beta_start, beta_end: linear noise schedule endpoints;
batch_size: mini-batch size;
max_steps: optimizer steps;
lr: AdamW learning rate;
dropout: MLP dropout;
seed: reproducibility;
device: CPU or CUDA target.

Methods:

fit(X, context=None): trains the denoiser on noised rows.
sample(n, context=None): draws synthetic standardized rows.
_setup_schedule(): creates $\beta_t$, $\alpha_t$, $\bar \alpha_t$, and square-root cache tensors.
_context_tensor(context, n): validates or creates conditioning rows.

The API mirrors other Trex estimators: fit on arrays or data-frame values, sample tensors, and let the calling benchmark handle the domain-specific postprocessing.

Callback Design for a Richer Diffusion Framework

The current Trex diffusion class has no callback system. That is fine for the small benchmark, but structural extensions would benefit from hooks. A sensible callback protocol would be:

class DiffusionCallback:
    def on_fit_start(self, state): ...
    def on_batch_start(self, state): ...
    def on_loss_ready(self, state): ...
    def on_step_end(self, state): ...
    def on_sample_start(self, state): ...
    def on_sample_step(self, state): ...
    def on_sample_end(self, state): ...

The state object should expose:

model;
optimizer;
current clean rows x0;
current noisy rows x_t;
current time index t;
context c;
predicted noise;
base denoising loss;
auxiliary losses;
generated sample during reverse diffusion;
diagnostic storage.

Useful callback classes:

MomentRestrictionCallback: adds losses for empirical moments, conditional moments, or score equations.
ConstraintProjectionCallback: projects samples back into feasible sets after each reverse step.
DiscreteColumnCallback: handles binary or categorical variables with logits instead of Gaussian residuals.
MonotonicityCallback: penalizes violations of monotonic structural restrictions.
BellmanResidualCallback: adds dynamic-programming residual penalties.
LikelihoodBridgeCallback: combines denoising loss with tractable pieces of a likelihood.
DiagnosticsCallback: records Wasserstein, KS, covariance, and correlation metrics during training.
EarlyStoppingCallback: stops when validation denoising loss or distribution metrics stop improving.

This kind of callback system would keep the base diffusion sampler clean while letting structural problems inject economics-specific restrictions.

Diffusion-Assisted Simulated Method of Moments

Classical simulated method of moments chooses parameters by matching observed moments to simulated moments. With data moments $m_{\text{data}}$ and simulated moments $m_\theta$, the criterion is

\[ Q(\theta) = \left(m_{\text{data}} - \mathbb{E}_\theta[m(Y)]\right)' W \left(m_{\text{data}} - \mathbb{E}_\theta[m(Y)]\right). \]

The structural model supplies the simulation law $Y \sim P_\theta$. Diffusion should not replace that law if the goal is structural interpretation. The useful role is more modest: diffusion can be an auxiliary sampler, proposal distribution, amortized inverse map, or regularizer inside a simulation-based estimator.

There are several clean variants.

Diffusion as a Proposal Simulator

Train a diffusion model to generate latent variables or paths that are likely under the observed data. Use those draws as proposals for importance sampling, indirect inference, or SMM. The structural simulator still evaluates or reweights the proposals. This is attractive when naive simulation wastes draws in regions that cannot rationalize the observed outcomes.

Score-Guided Moment Matching

Let $R_\theta(x)$ be a structural penalty, such as a moment discrepancy, Bellman residual, revealed-preference violation, or market-clearing residual. During sampling, guide the reverse process by combining the learned data score with a structural gradient:

\[ s_{\text{guided}}(x,t;\theta) = s_\theta(x,t) - \gamma_t \nabla_x R_\theta(x). \]

The denoiser keeps samples on the empirical support. The structural gradient nudges them toward economically admissible regions. This resembles classifier or inverse-problem guidance in score-based models, but the guidance comes from moments or equilibrium restrictions rather than image labels.

Joint Denoising and Moment Losses

Training can add a moment penalty to the denoising objective:

\[ \mathcal{L}(\phi;\theta) = \mathbb{E} \left[ \lVert \epsilon - \epsilon_\phi(x_t,t,c) \rVert_2^2 \right] + \lambda \left\lVert \hat m_{\phi,\theta} - m_{\text{data}} \right\rVert_W^2. \]

This is a hybrid estimator. It is no longer a pure generative model, so it needs careful validation. But it gives a direct way to train a sampler whose reverse path respects both observed support and target moments.

Amortized Structural Inversion

Many structural estimators repeatedly solve inverse problems: find shocks that rationalize choices, infer latent types from panels, or recover residual paths consistent with observed actions. A conditional diffusion model can learn

\[ p(u \mid y,x,\theta), \]

where $u$ are latent shocks or states. Once trained, the model gives fast conditional draws for each candidate $\theta$. This could reduce the cost of SMM or likelihood-free estimation, especially in dynamic models where the expensive step is repeatedly generating admissible latent histories.

The important discipline is to keep the structural moments explicit. A diffusion-based SMM workflow should report the usual moment fit, weighting matrix, Monte Carlo error, and counterfactual checks. The denoiser is a computational device, not an identification argument.

Cross-Sectional Structural Models

Consider a static discrete choice model. A simple utility specification is

\[ u_{ij} = x_{ij}'\beta + \epsilon_{ij}, \]

and observed choice is

\[ y_i = \arg\max_j u_{ij}. \]

The standard likelihood integrates over or assumes a distribution for $\epsilon_i$. A diffusion-style approach could instead define a corruption process over latent utilities, shocks, or choice-probability logits.

Possible objects to diffuse:

Latent utilities. Corrupt $u_i$ and train a denoiser conditional on observed $x_i$ and $y_i$.
Shocks. Corrupt $\epsilon_i$ and learn a conditional shock distribution given choices.
Choice logits. Corrupt logits on the simplex, with a projection or softmax map back to probabilities.
Parameters plus latent variables. Treat $(\beta, \epsilon_{1:n})$ as a joint object and train a denoiser conditional on observed choices.

A structural callback could add the revealed-preference inequality:

\[ u_{i,y_i} \ge u_{ij} \qquad \text{for all } j. \]

The denoising model would then learn to undo corruption while respecting the observed choice event. This resembles simulation-based inference, but with a learned reverse transition rather than hand-designed importance sampling.

Dynamic Discrete Choice

Dynamic discrete choice adds states, actions, and continuation values. A canonical structure is:

\[ v(s) = \mathbb{E}_\epsilon \left[ \max_a \left\{ u(s,a;\theta) + \epsilon_a + \beta \mathbb{E}[v(s') \mid s,a] \right\} \right]. \]

Observed data contain transitions $(s_t,a_t,s_{t+1})$. The hard part is that likelihood evaluation often requires solving or approximating the fixed point for $v$.

Diffusion suggests several alternatives.

Diffuse Value Residuals

Let $r_v(s)$ be a Bellman residual:

\[ r_v(s) = v(s) - \mathcal{T}_\theta v(s). \]

One could corrupt candidate value functions or residual fields and train a denoiser whose reverse steps move toward low Bellman residual. A callback would add:

\[ \lambda_v \lVert v - \mathcal{T}_\theta v \rVert^2. \]

This does not remove the economics. It turns the Bellman equation into a structural regularizer on a generative trajectory.

Diffuse Latent Shocks Conditional on Actions

For observed action $a_t$, the latent shocks must rationalize:

\[ u(s_t,a_t;\theta) + \epsilon_{a_t} + \beta \mathbb{E}[v(s_{t+1}) \mid s_t,a_t] \ge u(s_t,a;\theta) + \epsilon_a + \beta \mathbb{E}[v(s') \mid s_t,a] \]

for each alternative $a$. A diffusion model over shocks conditional on $(s_t,a_t)$ could generate shock draws consistent with observed choices. This gives a simulation device for counterfactual policies and likelihood-free estimation.

Diffuse Policy Functions

Instead of modeling shocks, define a policy object $\pi(a \mid s)$. Corrupt logits for $\pi$ and train a denoiser that also satisfies:

observed action likelihood;
Bellman optimality or entropy-regularized optimality;
transition consistency;
shape restrictions or exclusion restrictions.

This turns policy estimation into a constrained generative problem over functions.

Why This Might Be Useful

The attraction is not that diffusion magically solves identification. It does not. The attraction is computational:

it gives a stable supervised loss even when the target simulator is complex;
it can mix observed-data losses with structural residual losses;
it can generate latent variables conditional on observed events;
it can amortize expensive inverse problems;
it exposes a natural path for constraints through callbacks and projections;
it may provide proposal distributions for simulation-based estimation.

For structural econometrics, the most plausible use is as an auxiliary sampler or amortized solver, not as a replacement for the model. The structural model should still define the restrictions. Diffusion supplies a flexible corrupt-and-undo computational skeleton.

The analogy to Song’s score-based inverse-problem framing is close. In imaging, the unconditional score keeps generated objects on the learned image manifold while a measurement model enforces consistency with observed pixels, MRI measurements, or other data. In structural applications, the unconditional or conditional tabular score keeps objects on the learned empirical support while moment equations, choice inequalities, Bellman equations, or equilibrium restrictions play the role of the measurement model.

Failure Modes

Important risks:

Validity drift. Gaussian diffusion over raw table columns can generate invalid discrete states or impossible choices unless constrained.
Identification confusion. A denoiser can match observed distributions without identifying structural primitives.
Over-regularized economics. Bellman or moment penalties can dominate the denoising loss and collapse diversity.
Simulation bias. A learned reverse process can introduce bias into downstream estimators if treated as exact.
Evaluation difficulty. Good marginal fit is not enough; structural counterfactuals need policy-relevant validation.

The safe research posture is to use diffusion as a simulator or proposal mechanism and then check structural moments, counterfactual invariances, and estimator performance directly.

Suggested Trex Roadmap

Near-term:

Keep TabularDiffusion simple and stable.
Add optional callbacks only after a second real use case appears.
Add schedule choices: linear, cosine, and possibly variance-preserving continuous time.
Add deterministic DDIM sampling for faster repeated Monte Carlo.
Add mixed-type heads for binary and categorical columns.

Structural experiments:

Cross-sectional logit/probit: diffuse latent utilities or shocks conditional on observed choices.
Static games: diffuse payoff shocks or equilibrium-selection residuals under inequality constraints.
Dynamic discrete choice: diffuse shock vectors, policy logits, or Bellman residual fields.
Compare against likelihood, simulated method of moments, and neural conditional moment baselines.

The first serious experiment should be small: a synthetic Rust-style replacement model or a two-action dynamic choice model where the true policy and value function are known. That gives a direct way to test whether the denoiser learns structural objects or merely fits observed paths.

Minimal API Sketch

diffusion = TabularDiffusion(
    hidden_dims=(128, 128),
    n_timesteps=100,
    batch_size=256,
    max_steps=2_000,
    lr=1e-3,
    device="cuda",
)

z = transformer.fit_transform(df)
diffusion.fit(z, context=None)
z_fake = diffusion.sample(len(df))
df_fake = transformer.inverse_transform(z_fake)

A future structural version could look like:

model = StructuralDiffusion(
    base_diffusion=TabularDiffusion(...),
    callbacks=[
        ChoiceInequalityCallback(utility=utility_fn),
        MomentRestrictionCallback(moments=g),
        DiagnosticsCallback(metrics=["ks", "wasserstein", "bellman_residual"]),
    ],
)

model.fit(observed_paths, context=states)
counterfactual_paths = model.sample_paths(policy=counterfactual_policy)

The design principle should be strict: the diffusion engine handles corruption and denoising; structural callbacks define economic restrictions.

References

Yang Song, Generative Modeling by Estimating Gradients of the Data Distribution, 2021.
Jonathan Ho, Ajay Jain, and Pieter Abbeel, “Denoising Diffusion Probabilistic Models,” 2020.
Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole, “Score-Based Generative Modeling through Stochastic Differential Equations,” 2021.

--- title: "DDPM Diffusion for Tabular Rows and Structural Econometrics" subtitle: "A design note for Trex simulation DGPs and corrupt-and-undo estimators" format: html: embed-resources: true page-layout: full toc: true toc-depth: 3 code-fold: true code-tools: true execute: echo: true warning: false message: false jupyter: python3 --- ## Purpose This note spells out the DDPM-style diffusion model currently used for tabular rows in Trex and sketches how the same corrupt-and-undo structure might become useful for structural econometric models. The immediate object is a simulation DGP: learn the distribution of observed rows and draw synthetic rows. The more interesting research question is whether diffusion gives a useful computational pattern for fitting models whose likelihoods, equilibrium mappings, or dynamic programs are awkward. The key idea is simple: 1. Take a clean object $x_0$. 2. Corrupt it through a known stochastic path until it is close to noise. 3. Train a network to undo one corruption step, or to predict the noise that was added. 4. Use the trained denoiser as a simulator, sampler, or regularized inverse map. For tabular simulation, $x_0$ is a row of standardized covariates and outcomes. For structural models, $x_0$ could be a latent utility shock, a vector of actions and states, a policy object, a value-function residual, or a latent type. This note follows the score-based perspective in Yang Song's 2021 overview, [Generative Modeling by Estimating Gradients of the Data Distribution](https://yang-song.net/blog/2021/score/). That post is useful here because it strips diffusion models down to the central object: learn the score $\nabla_x \log p_t(x)$ of progressively noise-perturbed data distributions, then use that learned vector field to move samples from noise back toward data. ## Score-Based Fundamentals The score of a density $p(x)$ is $$ s(x) = \nabla_x \log p(x). $$ It points in the local direction of increasing log density. A score model $s_\theta(x)$ tries to approximate this vector field without requiring a normalized likelihood. This is the key computational advantage over a generic likelihood model: the score is insensitive to the unknown normalizing constant. If a good score model is available, Langevin dynamics can sample from the corresponding distribution using only the score: $$ x_{k+1} = x_k + \eta s_\theta(x_k) + \sqrt{2\eta} z_k, \qquad z_k \sim \mathcal{N}(0,I). $$ In a high-dimensional problem, directly estimating the score of the raw data distribution is hard. Data often live near a lower-dimensional set, and score matching puts most weight where the data density is already high. The score can be poor in low-density regions, exactly where a sampler initialized from noise begins. The score-based solution is to train on many noise-perturbed versions of the data. Large noise levels smooth the distribution and make global directions easier to learn; small noise levels recover local detail. With finite noise levels $\sigma_1 < \cdots < \sigma_L$, the model estimates $$ s_\theta(x,\sigma_\ell) \approx \nabla_x \log p_{\sigma_\ell}(x), $$ where $p_{\sigma_\ell}$ is the data distribution convolved with Gaussian noise. Sampling then starts at the highest noise level and anneals down. This is annealed Langevin dynamics: $$ x_{k+1} = x_k + \eta_\ell s_\theta(x_k,\sigma_\ell) + \sqrt{2\eta_\ell}z_k, \qquad \ell = L,L-1,\ldots,1. $$ Continuous-time score-based models replace the finite noise ladder with an SDE: $$ dx = f(x,t)dt + g(t)dw. $$ The forward SDE maps data at $t=0$ into a tractable noise distribution at $t=T$. The reverse-time SDE is $$ dx = \left[ f(x,t) - g(t)^2 \nabla_x \log p_t(x) \right]dt + g(t)d\bar w, $$ solved backward from $T$ to $0$. Once $\nabla_x \log p_t(x)$ is replaced by $s_\theta(x,t)$, this becomes a generative sampler. The same marginals can also be generated by the probability-flow ODE: $$ dx = \left[ f(x,t) - \frac{1}{2}g(t)^2 \nabla_x \log p_t(x) \right]dt. $$ The ODE view matters because it connects score models to continuous normalizing flows and likelihood computation. The SDE view matters because it connects directly to stochastic sampling, inverse problems, and constraint-guided generation. ## Current Trex Implementation The relevant Trex file is `trex/simdgp.py`. The diffusion path is deliberately small: - `TabularTransformer`: standardizes continuous variables, preserves binary columns, and inverts generated rows back to data scale. - `_TimeEmbedding`: maps integer diffusion times into sinusoidal features. - `_Denoiser`: neural network that receives a noisy row, a time embedding, and optional context. - `TabularDiffusion`: estimator interface with `fit`, `sample`, schedule setup, and context handling. - `distribution_metrics`: diagnostic function for generated versus real samples. The current implementation is a compact DDPM. It is not a full diffusion framework yet: there are no callbacks, no pluggable schedules, no constraint projections beyond postprocessing, and no structural residual losses. That is the right starting point for a simulation-DGP primitive. The research extensions below describe what a richer version could look like. ## Data Representation Let a tabular row be $x \in \mathbb{R}^d$. Trex uses a preprocessing map $S$: $$ z = S(x). $$ For continuous columns, $S$ subtracts the empirical mean and divides by the empirical standard deviation. For binary columns, $S$ leaves the value on the $0/1$ scale. This keeps the neural model from being dominated by earnings columns while still making binary treatment and demographic columns interpretable. Generation works on standardized rows $z$. After sampling, Trex applies $S^{-1}$ and then: - rounds binary columns to $0/1$; - clips nonnegative columns at zero. This is a practical compromise. A richer tabular diffusion model might use discrete diffusion for binary columns, monotone transforms for skewed earnings, or separate heads by column type. The current version keeps the model simple and makes the benchmark easy to audit. ## Forward Corruption Process DDPM starts with a fixed noising schedule. Choose positive numbers $$ \beta_1,\ldots,\beta_T, $$ with small $\beta_t$ early and larger $\beta_t$ later. Define $$ \alpha_t = 1 - \beta_t, \qquad \bar \alpha_t = \prod_{s=1}^{t}\alpha_s. $$ For a clean standardized row $x_0$, the forward process has the closed form $$ x_t = \sqrt{\bar \alpha_t}x_0 + \sqrt{1-\bar \alpha_t}\epsilon, \qquad \epsilon \sim \mathcal{N}(0,I). $$ This is useful because we do not need to simulate every earlier step during training. We can sample a time $t$, sample noise $\epsilon$, construct $x_t$ directly, and train the model to recover $\epsilon$. In Trex this appears in `TabularDiffusion.fit`: ```python t = torch.randint(0, self.n_timesteps, (x0.shape[0],), device=self.device) noise = torch.randn_like(x0) sqrt_alpha_bar = self.sqrt_alpha_bars[t].unsqueeze(1) sqrt_one_minus = self.sqrt_one_minus_alpha_bars[t].unsqueeze(1) x_t = sqrt_alpha_bar * x0 + sqrt_one_minus * noise pred_noise = self.denoiser(x_t, t, ctx) loss = F.mse_loss(pred_noise, noise) ``` The learning problem is therefore: $$ \min_\theta \mathbb{E}_{x_0,t,\epsilon} \left[ \lVert \epsilon - \epsilon_\theta(x_t,t,c) \rVert_2^2 \right], $$ where $c$ is optional context. In an unconditional DGP, $c$ is empty. In a conditional DGP, $c$ might be treatment assignment, market, cohort, state, or any other conditioning variable. This noise-prediction objective is also a score-estimation objective. Conditional on $x_0$, the Gaussian perturbation density has score $$ \nabla_{x_t}\log q(x_t \mid x_0) = -\frac{x_t - \sqrt{\bar \alpha_t}x_0}{1-\bar \alpha_t} = -\frac{\epsilon}{\sqrt{1-\bar \alpha_t}}. $$ If the network predicts $\epsilon_\theta(x_t,t,c)$, it implicitly gives a score estimate $$ s_\theta(x_t,t,c) \approx -\frac{\epsilon_\theta(x_t,t,c)}{\sqrt{1-\bar \alpha_t}}. $$ That is the direct bridge between the compact DDPM implementation in Trex and the broader score-based framework. The code trains a denoiser, but the denoiser is learning the vector field that points noisy rows back toward high-density regions of the data distribution. ## Reverse Sampling Process At sampling time, start from Gaussian noise: $$ x_T \sim \mathcal{N}(0,I). $$ Then run the denoiser backward: $$ \mu_\theta(x_t,t,c) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar \alpha_t}} \epsilon_\theta(x_t,t,c) \right). $$ Trex uses the simple DDPM reverse step: $$ x_{t-1} = \mu_\theta(x_t,t,c) + \sqrt{\beta_t}\eta_t, \qquad \eta_t \sim \mathcal{N}(0,I), $$ except at the final step, where no new noise is added. In code: ```python for step in reversed(range(self.n_timesteps)): t = torch.full((n,), step, device=self.device, dtype=torch.long) beta = self.betas[t].unsqueeze(1) alpha = self.alphas[t].unsqueeze(1) alpha_bar = self.alpha_bars[t].unsqueeze(1) pred_noise = self.denoiser(x, t, c) mean = (x - beta / torch.sqrt(1 - alpha_bar) * pred_noise) / torch.sqrt(alpha) if step > 0: x = mean + torch.sqrt(beta) * torch.randn_like(x) else: x = mean ``` The current sampler is intentionally literal. Possible upgrades are deterministic DDIM sampling, learned variance, predictor-corrector steps, or constraint-aware projections after each step. ## Structured Objects: Images Versus Tables Diffusion became famous on images because images have strong structure that neural networks can exploit: - pixels live on a grid; - nearby pixels have local dependence; - patterns repeat under translation; - coarse-to-fine semantic structure matches the high-noise to low-noise sampling path; - U-Net architectures and convolutional residual blocks encode useful inductive bias. Images are also usually treated as continuous after dequantization. This makes Gaussian corruption mathematically natural even when the original pixel values are discrete integers. Tabular rows are structured in a different way. A row has columns, not spatial locations. Column order is mostly arbitrary. Some columns are continuous, some are binary, some are counts, and some are categorical. Constraints can be hard: treatment must be $0/1$, earnings may be nonnegative, education is integer-valued, and logically impossible combinations should be excluded. The relevant structure is not locality on a grid, but joint dependence, support constraints, conditional moments, tail behavior, and known economic restrictions. The present Lalonde application is therefore a modest but useful tabular test. The target object is one standardized row containing treatment, demographics, and earnings outcomes. The DDPM is not trying to create a realistic person in the rich image-generation sense. It is trying to learn a sampler whose generated rows preserve marginal distributions, pairwise dependence, and treatment-outcome relationships better than simpler baselines. For structural econometrics, the structured object may be even less image-like: - a vector of latent taste shocks; - a panel path of states, actions, and outcomes; - a vector of simulated moments; - a discretized policy or value function; - a sequence of equilibrium residuals; - a joint object $(\theta, u, y)$ containing parameters, latent variables, and observables. That is why the diffusion abstraction should be kept modular. The useful part is not the image architecture. The useful part is the corruption path, learned reverse map, and ability to inject restrictions during sampling. ## Class and Method Map ### `TabularTransformer` Responsibilities: - infer column names from a pandas data frame when available; - identify binary, continuous, and nonnegative columns; - compute means and standard deviations; - transform raw rows into training scale; - inverse-transform generated rows into raw scale; - round binary outputs and clip nonnegative outputs. Important methods: - `fit(data)`: stores column metadata and scaling statistics. - `transform(data)`: returns standardized numeric rows. - `fit_transform(data)`: convenience method. - `inverse_transform(values, sample_binary=False, random_state=None)`: returns raw-scale rows. - `transformed_bounds()`: gives transformed lower and upper bounds for constrained generators. For diffusion, `TabularTransformer` is not part of the neural model. It is part of the statistical contract: the model sees numerically stable rows, and the user gets valid rows back. ### `_TimeEmbedding` Responsibilities: - map an integer time step $t$ into a vector of sinusoidal features; - give the denoiser a smooth representation of diffusion time. The embedding is analogous to transformer positional encodings: $$ e(t) = (\sin(t\omega_1),\ldots,\sin(t\omega_k), \cos(t\omega_1),\ldots,\cos(t\omega_k)). $$ This lets a single denoising network learn different behavior at low-noise and high-noise steps. ### `_Denoiser` Responsibilities: - concatenate noisy row $x_t$, optional context $c$, and time embedding $e(t)$; - predict the Gaussian noise $\epsilon$ added to the row. Current architecture: ```text [x_t, context, time_embedding] -> MLP -> predicted_noise ``` This is deliberately boring. For tabular rows, a small MLP is often enough to test the idea. Richer architectures could add: - separate continuous and discrete heads; - residual blocks; - monotonicity constraints; - cross-feature attention; - group-specific embeddings; - score heads tied to moment restrictions. ### `TabularDiffusion` Responsibilities: - own the diffusion schedule; - own the denoising network; - implement the estimator lifecycle; - expose `fit` and `sample`. Constructor parameters: - `hidden_dims`: MLP widths; - `time_dim`: dimension of the time embedding; - `n_timesteps`: number of diffusion steps; - `beta_start`, `beta_end`: linear noise schedule endpoints; - `batch_size`: mini-batch size; - `max_steps`: optimizer steps; - `lr`: AdamW learning rate; - `dropout`: MLP dropout; - `seed`: reproducibility; - `device`: CPU or CUDA target. Methods: - `fit(X, context=None)`: trains the denoiser on noised rows. - `sample(n, context=None)`: draws synthetic standardized rows. - `_setup_schedule()`: creates $\beta_t$, $\alpha_t$, $\bar \alpha_t$, and square-root cache tensors. - `_context_tensor(context, n)`: validates or creates conditioning rows. The API mirrors other Trex estimators: fit on arrays or data-frame values, sample tensors, and let the calling benchmark handle the domain-specific postprocessing. ## Callback Design for a Richer Diffusion Framework The current Trex diffusion class has no callback system. That is fine for the small benchmark, but structural extensions would benefit from hooks. A sensible callback protocol would be: ```python class DiffusionCallback: def on_fit_start(self, state): ... def on_batch_start(self, state): ... def on_loss_ready(self, state): ... def on_step_end(self, state): ... def on_sample_start(self, state): ... def on_sample_step(self, state): ... def on_sample_end(self, state): ... ``` The `state` object should expose: - model; - optimizer; - current clean rows `x0`; - current noisy rows `x_t`; - current time index `t`; - context `c`; - predicted noise; - base denoising loss; - auxiliary losses; - generated sample during reverse diffusion; - diagnostic storage. Useful callback classes: - `MomentRestrictionCallback`: adds losses for empirical moments, conditional moments, or score equations. - `ConstraintProjectionCallback`: projects samples back into feasible sets after each reverse step. - `DiscreteColumnCallback`: handles binary or categorical variables with logits instead of Gaussian residuals. - `MonotonicityCallback`: penalizes violations of monotonic structural restrictions. - `BellmanResidualCallback`: adds dynamic-programming residual penalties. - `LikelihoodBridgeCallback`: combines denoising loss with tractable pieces of a likelihood. - `DiagnosticsCallback`: records Wasserstein, KS, covariance, and correlation metrics during training. - `EarlyStoppingCallback`: stops when validation denoising loss or distribution metrics stop improving. This kind of callback system would keep the base diffusion sampler clean while letting structural problems inject economics-specific restrictions. ## Diffusion-Assisted Simulated Method of Moments Classical simulated method of moments chooses parameters by matching observed moments to simulated moments. With data moments $m_{\text{data}}$ and simulated moments $m_\theta$, the criterion is $$ Q(\theta) = \left(m_{\text{data}} - \mathbb{E}_\theta[m(Y)]\right)' W \left(m_{\text{data}} - \mathbb{E}_\theta[m(Y)]\right). $$ The structural model supplies the simulation law $Y \sim P_\theta$. Diffusion should not replace that law if the goal is structural interpretation. The useful role is more modest: diffusion can be an auxiliary sampler, proposal distribution, amortized inverse map, or regularizer inside a simulation-based estimator. There are several clean variants. ### Diffusion as a Proposal Simulator Train a diffusion model to generate latent variables or paths that are likely under the observed data. Use those draws as proposals for importance sampling, indirect inference, or SMM. The structural simulator still evaluates or reweights the proposals. This is attractive when naive simulation wastes draws in regions that cannot rationalize the observed outcomes. ### Score-Guided Moment Matching Let $R_\theta(x)$ be a structural penalty, such as a moment discrepancy, Bellman residual, revealed-preference violation, or market-clearing residual. During sampling, guide the reverse process by combining the learned data score with a structural gradient: $$ s_{\text{guided}}(x,t;\theta) = s_\theta(x,t) - \gamma_t \nabla_x R_\theta(x). $$ The denoiser keeps samples on the empirical support. The structural gradient nudges them toward economically admissible regions. This resembles classifier or inverse-problem guidance in score-based models, but the guidance comes from moments or equilibrium restrictions rather than image labels. ### Joint Denoising and Moment Losses Training can add a moment penalty to the denoising objective: $$ \mathcal{L}(\phi;\theta) = \mathbb{E} \left[ \lVert \epsilon - \epsilon_\phi(x_t,t,c) \rVert_2^2 \right] + \lambda \left\lVert \hat m_{\phi,\theta} - m_{\text{data}} \right\rVert_W^2. $$ This is a hybrid estimator. It is no longer a pure generative model, so it needs careful validation. But it gives a direct way to train a sampler whose reverse path respects both observed support and target moments. ### Amortized Structural Inversion Many structural estimators repeatedly solve inverse problems: find shocks that rationalize choices, infer latent types from panels, or recover residual paths consistent with observed actions. A conditional diffusion model can learn $$ p(u \mid y,x,\theta), $$ where $u$ are latent shocks or states. Once trained, the model gives fast conditional draws for each candidate $\theta$. This could reduce the cost of SMM or likelihood-free estimation, especially in dynamic models where the expensive step is repeatedly generating admissible latent histories. The important discipline is to keep the structural moments explicit. A diffusion-based SMM workflow should report the usual moment fit, weighting matrix, Monte Carlo error, and counterfactual checks. The denoiser is a computational device, not an identification argument. ## Cross-Sectional Structural Models Consider a static discrete choice model. A simple utility specification is $$ u_{ij} = x_{ij}'\beta + \epsilon_{ij}, $$ and observed choice is $$ y_i = \arg\max_j u_{ij}. $$ The standard likelihood integrates over or assumes a distribution for $\epsilon_i$. A diffusion-style approach could instead define a corruption process over latent utilities, shocks, or choice-probability logits. Possible objects to diffuse: 1. **Latent utilities.** Corrupt $u_i$ and train a denoiser conditional on observed $x_i$ and $y_i$. 2. **Shocks.** Corrupt $\epsilon_i$ and learn a conditional shock distribution given choices. 3. **Choice logits.** Corrupt logits on the simplex, with a projection or softmax map back to probabilities. 4. **Parameters plus latent variables.** Treat $(\beta, \epsilon_{1:n})$ as a joint object and train a denoiser conditional on observed choices. A structural callback could add the revealed-preference inequality: $$ u_{i,y_i} \ge u_{ij} \qquad \text{for all } j. $$ The denoising model would then learn to undo corruption while respecting the observed choice event. This resembles simulation-based inference, but with a learned reverse transition rather than hand-designed importance sampling. ## Dynamic Discrete Choice Dynamic discrete choice adds states, actions, and continuation values. A canonical structure is: $$ v(s) = \mathbb{E}_\epsilon \left[ \max_a \left\{ u(s,a;\theta) + \epsilon_a + \beta \mathbb{E}[v(s') \mid s,a] \right\} \right]. $$ Observed data contain transitions $(s_t,a_t,s_{t+1})$. The hard part is that likelihood evaluation often requires solving or approximating the fixed point for $v$. Diffusion suggests several alternatives. ### Diffuse Value Residuals Let $r_v(s)$ be a Bellman residual: $$ r_v(s) = v(s) - \mathcal{T}_\theta v(s). $$ One could corrupt candidate value functions or residual fields and train a denoiser whose reverse steps move toward low Bellman residual. A callback would add: $$ \lambda_v \lVert v - \mathcal{T}_\theta v \rVert^2. $$ This does not remove the economics. It turns the Bellman equation into a structural regularizer on a generative trajectory. ### Diffuse Latent Shocks Conditional on Actions For observed action $a_t$, the latent shocks must rationalize: $$ u(s_t,a_t;\theta) + \epsilon_{a_t} + \beta \mathbb{E}[v(s_{t+1}) \mid s_t,a_t] \ge u(s_t,a;\theta) + \epsilon_a + \beta \mathbb{E}[v(s') \mid s_t,a] $$ for each alternative $a$. A diffusion model over shocks conditional on $(s_t,a_t)$ could generate shock draws consistent with observed choices. This gives a simulation device for counterfactual policies and likelihood-free estimation. ### Diffuse Policy Functions Instead of modeling shocks, define a policy object $\pi(a \mid s)$. Corrupt logits for $\pi$ and train a denoiser that also satisfies: - observed action likelihood; - Bellman optimality or entropy-regularized optimality; - transition consistency; - shape restrictions or exclusion restrictions. This turns policy estimation into a constrained generative problem over functions. ## Why This Might Be Useful The attraction is not that diffusion magically solves identification. It does not. The attraction is computational: - it gives a stable supervised loss even when the target simulator is complex; - it can mix observed-data losses with structural residual losses; - it can generate latent variables conditional on observed events; - it can amortize expensive inverse problems; - it exposes a natural path for constraints through callbacks and projections; - it may provide proposal distributions for simulation-based estimation. For structural econometrics, the most plausible use is as an auxiliary sampler or amortized solver, not as a replacement for the model. The structural model should still define the restrictions. Diffusion supplies a flexible corrupt-and-undo computational skeleton. The analogy to Song's score-based inverse-problem framing is close. In imaging, the unconditional score keeps generated objects on the learned image manifold while a measurement model enforces consistency with observed pixels, MRI measurements, or other data. In structural applications, the unconditional or conditional tabular score keeps objects on the learned empirical support while moment equations, choice inequalities, Bellman equations, or equilibrium restrictions play the role of the measurement model. ## Failure Modes Important risks: - **Validity drift.** Gaussian diffusion over raw table columns can generate invalid discrete states or impossible choices unless constrained. - **Identification confusion.** A denoiser can match observed distributions without identifying structural primitives. - **Over-regularized economics.** Bellman or moment penalties can dominate the denoising loss and collapse diversity. - **Simulation bias.** A learned reverse process can introduce bias into downstream estimators if treated as exact. - **Evaluation difficulty.** Good marginal fit is not enough; structural counterfactuals need policy-relevant validation. The safe research posture is to use diffusion as a simulator or proposal mechanism and then check structural moments, counterfactual invariances, and estimator performance directly. ## Suggested Trex Roadmap Near-term: 1. Keep `TabularDiffusion` simple and stable. 2. Add optional callbacks only after a second real use case appears. 3. Add schedule choices: linear, cosine, and possibly variance-preserving continuous time. 4. Add deterministic DDIM sampling for faster repeated Monte Carlo. 5. Add mixed-type heads for binary and categorical columns. Structural experiments: 1. Cross-sectional logit/probit: diffuse latent utilities or shocks conditional on observed choices. 2. Static games: diffuse payoff shocks or equilibrium-selection residuals under inequality constraints. 3. Dynamic discrete choice: diffuse shock vectors, policy logits, or Bellman residual fields. 4. Compare against likelihood, simulated method of moments, and neural conditional moment baselines. The first serious experiment should be small: a synthetic Rust-style replacement model or a two-action dynamic choice model where the true policy and value function are known. That gives a direct way to test whether the denoiser learns structural objects or merely fits observed paths. ## Minimal API Sketch ```python diffusion = TabularDiffusion( hidden_dims=(128, 128), n_timesteps=100, batch_size=256, max_steps=2_000, lr=1e-3, device="cuda", ) z = transformer.fit_transform(df) diffusion.fit(z, context=None) z_fake = diffusion.sample(len(df)) df_fake = transformer.inverse_transform(z_fake) ``` A future structural version could look like: ```python model = StructuralDiffusion( base_diffusion=TabularDiffusion(...), callbacks=[ ChoiceInequalityCallback(utility=utility_fn), MomentRestrictionCallback(moments=g), DiagnosticsCallback(metrics=["ks", "wasserstein", "bellman_residual"]), ], ) model.fit(observed_paths, context=states) counterfactual_paths = model.sample_paths(policy=counterfactual_policy) ``` The design principle should be strict: the diffusion engine handles corruption and denoising; structural callbacks define economic restrictions. ## References - Yang Song, [Generative Modeling by Estimating Gradients of the Data Distribution](https://yang-song.net/blog/2021/score/), 2021. - Jonathan Ho, Ajay Jain, and Pieter Abbeel, "Denoising Diffusion Probabilistic Models," 2020. - Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole, "Score-Based Generative Modeling through Stochastic Differential Equations," 2021.