---
title: "Diffusion Models for Realistic Tabular Simulation"
author: "Apoorva Lal"
date: today
bibliography: references.bib
format:
html:
embed-resources: true
page-layout: full
toc: true
toc-depth: 3
code-fold: true
code-tools: true
execute:
echo: true
warning: false
message: false
cache: true
jupyter: torch
---
## Abstract
Monte Carlo studies in econometrics are often analytically transparent but empirically thin: functional form, covariate distribution, error distribution, treatment assignment, and support restrictions are chosen for tractability rather than for fidelity to the empirical problems that make estimators fail. Athey, Imbens, Metzger, and Munro (2024) propose learning simulation designs from real datasets with Wasserstein GANs and then perturbing the fitted environment [@athey2024wgan]. This memo reruns that idea on local Lalonde/CPS data and compares a conditional WGAN-GP with a conditional DDPM under the same preprocessing, treatment vector, diagnostics, and regression-adjustment estimand. The useful conclusion is not that DDPM uniformly dominates WGAN. In these runs, WGAN is closer on the regression-adjustment ATT, while DDPM is much closer on the joint row-distribution metrics. For simulation design, that split is the point: a learned DGP should be judged by whether it preserves the empirical geometry that makes estimators succeed or fail, not only by whether one downstream estimand happens to line up in one draw.
## Evidence snapshot
The rendered results below are based on executed Quarto/Python chunks, not just prose. The source data are the local `ds-wgan` Lalonde/CPS files: `cps.feather` and `experimental.feather`. The convenience `cps_merged.feather` file is reconstructed from those sources when needed; it is not a separate data requirement.
Key outputs from this render:
- **Small sklearn held-out checks:** DDPM beats WGAN on sliced Wasserstein for iris, wine, breast-cancer, and diabetes feature tables; simple train-resampling remains competitive and is often better than both neural generators.
- **Lalonde/CPS single draw:** observed regression-adjustment ATT is **$690**. WGAN synthetic ATT is **$325**; DDPM synthetic ATT is **$1,888**. On the same draw, DDPM is much closer on row-distribution metrics, with sliced Wasserstein **0.928** versus WGAN **3.774**.
- **Repeated Lalonde/CPS draws (`B = 12`):** WGAN has lower mean absolute ATT error (**$356** vs **$773**), while DDPM has much lower mean sliced Wasserstein (**0.882** vs **3.684**) and lower marginal/correlation/covariance distances.
- **Treatment-share ablation:** changing the treatment vector from the observed 1.14% treated share to a balanced 500/500 design moves the DDPM ATT from **$1,888** to **$196**, illustrating how learned simulators can be used for controlled design perturbations rather than only synthetic-data mimicry.
Unsupported baselines are kept in their lane: copulas and flows are discussed as next baselines to implement, not as results; ICL is only a qualitative stress-test foil unless it is quantitatively scored.
## Introduction
Suppose an applied study observes rows
$$
X_i = (Y_i, D_i, W_i), \qquad i=1,\ldots,n,
$$
where $Y_i$ is an outcome, $D_i$ is a treatment or policy exposure, and $W_i$ collects pre-treatment covariates. A traditional Monte Carlo design specifies a distribution $P_0$ for $X_i$ and then evaluates estimators under repeated draws from $P_0$. The credibility of the exercise rests on whether $P_0$ captures the parts of the empirical problem that matter for the estimators being compared: overlap, tail behavior, discreteness, collinearity, heteroskedasticity, treatment imbalance, and nonlinear dependence.
The AIMM paper reframes the problem as a learned simulation design. Let $P_n$ be the empirical distribution of a real dataset. Fit a generator $\hat G$ so that samples from $\hat P_G$ look like draws from $P_n$. Then use $\hat P_G$ as the baseline simulation environment. This does not estimate structural primitives; it reduces arbitrary choices in simulation design while preserving the empirical geometry of the application.
This note uses the same objective and asks how the fitting engine changes when the adversarial WGAN is replaced by a denoising diffusion model.
## What Is the Simulation Object?
Before comparing generators, it helps to separate three distinct simulation objects.
**Design A: joint row simulation.** Learn
$$
P(Y, D, W).
$$
This is the most literal generative benchmark. It tries to reproduce the observed joint table and is useful when treatment prevalence and selection are part of the object of interest.
**Design B: conditional row simulation with fixed treatment.** Learn
$$
P(Y, W \mid D),
$$
then hold the observed treatment vector fixed when generating synthetic datasets. This is the design used in the Lalonde exercise below. It isolates the fitted conditional row distribution from changes in treatment prevalence.
**Design C: outcome simulation with a separate assignment model.** Learn
$$
P(Y \mid D, W)
\quad\text{and}\quad
P(D \mid W)
$$
separately, then intervene on the assignment mechanism or on the outcome surface. This is the right object when the simulation is about overlap deterioration, targeting rules, or policy counterfactuals rather than about reproducing one realized assignment vector.
The point of this distinction is that different baselines are good at different objects. Simple resampling works well when Design B is enough. Copulas or flows are natural when likelihood structure matters. WGANs and DDPMs become most useful when the target is a complicated joint row distribution with realistic dependence and when one wants structured ablations after fitting.
## Learned Simulation Designs
A GAN has two networks. A generator maps noise $Z$ into synthetic rows:
$$
\tilde X = G_\theta(Z), \qquad Z \sim P_Z.
$$
A discriminator or critic scores whether rows look real. In a Wasserstein GAN, the population objective is
$$
\min_\theta \max_{f \in \mathcal{F}_1}
\left\{
\mathbb{E}_{X \sim P_n}[f(X)]
-
\mathbb{E}_{Z \sim P_Z}[f(G_\theta(Z))]
\right\},
$$
where $\mathcal{F}_1$ is the class of 1-Lipschitz functions [@arjovsky2017wgan]. In practice, the Lipschitz constraint is enforced by weight clipping or a gradient penalty [@gulrajani2017wgangp].
For econometric simulation this has several attractions:
- the generator is fast once trained;
- it represents complicated row distributions without a parametric likelihood;
- conditional generation can be implemented by feeding treatment, cohort, or market labels into the generator;
- the Wasserstein critic gives a training signal with a distributional interpretation.
But the adversarial game is also the main cost. The generator improves only through the critic's current gradients. If the critic is weak, the generator learns the wrong directions. If the critic is too strong or poorly regularized, training can become unstable. Mode collapse is a particularly bad failure for simulation studies because the generated sample can look locally plausible while missing subpopulations that determine estimator risk.
The optimizer choice matters because the problem is a saddle-point problem, not ordinary risk minimization. Simultaneous Adam is convenient, but the game can rotate around equilibria rather than converge. Optimistic Adam, mirror descent, and other game-aware updates are natural modifications because they use information about recent gradients to damp these rotations. That is a repair to the WGAN fitting problem, not a change in the underlying object being learned.
## Denoising Diffusion Models
A denoising diffusion probabilistic model starts from the opposite direction. Instead of asking a generator to fool a critic, it defines a known corruption process that gradually turns data into noise. The model then learns to reverse that corruption.
Let $x_0 \in \mathbb{R}^d$ denote a standardized data row. Choose a noise schedule
$$
0 < \beta_1 < \cdots < \beta_T < 1,
$$
and define
$$
\alpha_t = 1-\beta_t,
\qquad
\bar\alpha_t = \prod_{s=1}^t \alpha_s.
$$
The forward corruption process is
$$
q(x_t \mid x_{t-1})
=
\mathcal{N}\left(
\sqrt{\alpha_t}x_{t-1},
\beta_t I
\right).
$$
By Gaussian algebra, one can sample $x_t$ directly from $x_0$:
$$
x_t
=
\sqrt{\bar\alpha_t}x_0
+
\sqrt{1-\bar\alpha_t}\epsilon,
\qquad
\epsilon \sim \mathcal{N}(0,I).
$$
The model is a denoiser $\epsilon_\theta(x_t,t,c)$, where $c$ is optional context. In tabular simulations, $c$ might be treatment status, market, cohort, site, panel time, or a policy environment. The simple DDPM training objective is
$$
\min_\theta
\mathbb{E}_{x_0,t,\epsilon}
\left[
\left\|
\epsilon - \epsilon_\theta(x_t,t,c)
\right\|_2^2
\right].
$$
This is ordinary supervised learning. The label is the known noise used to corrupt the clean row. Ho, Jain, and Abbeel show how this objective arises from a variational bound and connects to denoising score matching [@ho2020ddpm].
Sampling reverses the chain. Start with
$$
x_T \sim \mathcal{N}(0,I),
$$
and recursively apply
$$
\mu_\theta(x_t,t,c)
=
\frac{1}{\sqrt{\alpha_t}}
\left(
x_t
-
\frac{\beta_t}{\sqrt{1-\bar\alpha_t}}
\epsilon_\theta(x_t,t,c)
\right).
$$
Then draw
$$
x_{t-1}
=
\mu_\theta(x_t,t,c) + \sigma_t z_t,
\qquad
z_t \sim \mathcal{N}(0,I),
$$
with $\sigma_t$ set by the chosen reverse variance. After reaching $x_0$, invert the tabular preprocessing map and apply any column constraints: binary rounding, nonnegativity, bounded support, or category decoding.
### Density Modeling Interpretation
The DDPM is a density model built from many noisy versions of the data. The forward process defines a family of smoothed densities $p_t(x)$, where $p_0$ is the data distribution and $p_T$ is close to a standard Gaussian. The denoiser learns a vector field that points noisy observations back toward regions of high probability under $p_t$.
The score of a density is
$$
s_t(x) = \nabla_x \log p_t(x).
$$
For the Gaussian corruption step,
$$
\nabla_{x_t}\log q(x_t \mid x_0)
=
-
\frac{x_t-\sqrt{\bar\alpha_t}x_0}{1-\bar\alpha_t}
=
-
\frac{\epsilon}{\sqrt{1-\bar\alpha_t}}.
$$
Therefore a noise predictor implies a score estimate:
$$
s_\theta(x_t,t,c)
\approx
-
\frac{\epsilon_\theta(x_t,t,c)}
{\sqrt{1-\bar\alpha_t}}.
$$
This is the link to score-based generative modeling. Song, Sohl-Dickstein, Kingma, Kumar, Ermon, and Poole formulate the same idea in continuous time: add noise through a stochastic differential equation and sample backward using the estimated score field [@song2021sde]. DDPMs are one discrete-time implementation of that broader score-based idea.
The econometric analogy is useful. A likelihood model specifies $p_\theta(x)$ and maximizes $\sum_i \log p_\theta(X_i)$. A score-based model learns the local derivative $\nabla_x \log p_t(x)$ along many noisy versions of the distribution. It does not need a normalized likelihood. For simulation, this is enough: if the learned reverse chain approximately transports Gaussian noise into the empirical row distribution, then it supplies draws for a Monte Carlo design.
### Tabular Rows
The original DDPM success story is image generation, but the row-distribution problem is different. In image data, all coordinates are intensities with spatial structure. In an econometric row, coordinates may be continuous, binary, categorical, censored, top-coded, or logically constrained.
A practical tabular DDPM therefore needs a representation map
$$
z = S(x),
$$
where $S$ standardizes continuous variables, encodes categorical variables, and records constraints needed to invert the transformation. Continuous variables can be diffused with Gaussian noise. Binary and categorical variables can be handled in several ways:
- keep them as relaxed numeric variables during diffusion and round after sampling;
- use separate output heads and postprocessing rules;
- use discrete diffusion kernels designed for categorical state spaces.
The last option is conceptually cleaner. Multinomial diffusion and argmax flows explicitly model categorical variables [@hoogeboom2021argmax]. Structured denoising diffusion models in discrete state spaces extend the corruption process beyond uniform replacement kernels and provide a general D3PM framework [@austin2021d3pm]. TabDDPM combines Gaussian diffusion for numerical features with multinomial diffusion for categorical features and shows that diffusion can be competitive for general tabular data [@kotelnikov2023tabddpm].
For econometric simulation, the choice should be governed by the role of each variable. Rounding a binary demographic indicator may be acceptable in a first benchmark. Rounding treatment status is usually not acceptable if the estimand conditions on the realized assignment mechanism. In that case, treatment should be fixed as context or generated by a separate assignment model.
## Baseline Search: WGANs, Flows, Resampling, and ICL
The AIMM objective is to build realistic simulation designs from empirical datasets. Both WGANs and DDPMs are plausible tools. They differ mainly in what is hard.
| Feature | WGAN simulation | DDPM simulation |
|---|---|---|
| Training problem | Minimax game between generator and critic | Supervised denoising objective |
| Main tuning risk | Critic strength, Lipschitz penalty, optimizer dynamics, mode collapse | Noise schedule, denoiser capacity, sampling steps, preprocessing |
| Sampling speed | Fast single generator pass | Slower iterative denoising |
| Diagnostics | Critic loss plus downstream distribution checks | Denoising loss plus downstream distribution checks |
| Mode coverage | Can be fragile when generator collapses | Usually strong because training sees noisy neighborhoods at all scales |
| Conditional designs | Feed context to generator and critic | Feed context to denoiser at every reverse step |
| Ablations | Possible but can destabilize the adversarial game | Natural: intervene on context, corruption path, or reverse constraints |
This comparison explains why DDPMs are attractive for econometric practice. The fitting problem is closer to nonparametric regression than to adversarial equilibrium computation. Every training example supplies a known target $\epsilon$. The model is trained across many local perturbations of each row, which tends to preserve support and neighborhood structure. For a simulation study where the cost of a missed subpopulation can be high, that stability is valuable.
The tradeoff is sampling cost. A WGAN can generate $100{,}000$ rows with one forward pass per row. A DDPM may need tens or hundreds of denoising steps. Improved DDPMs learn reverse variances and improve the likelihood/sample-quality tradeoff [@nichol2021improved]. DDIMs reduce the number of sampling steps by using a deterministic or partially deterministic non-Markovian sampler with the same training objective [@song2021ddim]. In practice, a tabular simulation design can usually tolerate slower sampling because generating the synthetic benchmark is not the bottleneck relative to fitting many downstream estimators.
### Resampling and Classical Tabular Baselines
A useful benchmark set should start below deep generative models. If the goal is Design B above, then a stratified bootstrap or nearest-neighbor resampling baseline is hard to beat on transparency. It preserves exact support, needs almost no tuning, and often reproduces the empirical treatment mix and sparse cells better than a badly tuned neural generator.
A second baseline family is low-dimensional parametric or semiparametric approximation: Gaussian copulas, rank-Gaussian transforms, or factor models fit to the continuous margin and then combined with empirical or logistic models for discrete columns. These baselines are cheap, interpretable, and often surprisingly competitive on moderate-dimensional tabular problems. They are weak when the empirical design depends on multimodality, sharp support boundaries, or treatment-specific dependence that is not well summarized by a copula.
So the practical search order for a new simulation study should usually be:
1. resampling / nearest-neighbor baselines;
2. copula-style likelihood baselines;
3. neural tabular generators such as WGANs, flows, and DDPMs;
4. prompt-only in-context generation as a stress test rather than a default production baseline.
That ordering keeps the learned-design agenda honest. A deep generator should clear simple baselines on the dimensions that matter for the estimator, not just on the dimensions that look impressive in a generative-model table.
### Normalizing Flows
Normalizing flows are another natural density-modeling candidate. A flow constructs an invertible map $f_\theta$ such that
$$
z = f_\theta(x),
\qquad
z \sim p_Z,
$$
and evaluates the likelihood by the change-of-variables formula:
$$
\log p_\theta(x)
=
\log p_Z(f_\theta(x))
+
\log \left|
\det \frac{\partial f_\theta(x)}{\partial x}
\right|.
$$
This gives exact likelihoods and direct sampling, which is attractive for statistical work [@papamakarios2021flows]. The cost is architectural: $f_\theta$ must be invertible and have a tractable Jacobian determinant. Those constraints can make flows awkward for mixed discrete-continuous tabular rows, although many useful constructions exist.
Diffusion models relax invertibility. They do not give exact likelihood as directly as a flow, although likelihood bounds and probability-flow ODEs are available in the broader score-based framework. For simulation design, exact likelihood is less important than faithful reproduction of the empirical row distribution and stable conditional sampling. That tilts the practical comparison toward DDPMs unless likelihood evaluation itself is part of the research design.
### In-Context Learning as a Baseline
A different comparison class is prompt-only generation from a frontier language model. The attraction is obvious: no training loop, no architecture search, and very flexible conditioning. Give the model a few rows, ask for more rows in the same format, and it can often imitate local patterns surprisingly well.
That makes in-context learning (ICL) interesting as a diagnostic baseline, especially for tiny tabular designs or for structured what-if prompts. But it also has obvious weaknesses for econometric simulation. The model has no explicit likelihood, no stable sampling semantics, and no guarantee that repeated generations trace a coherent joint distribution. Its errors are often subtle: support drift, over-smoothing, and synthetic rows that look individually plausible but do not add up to the right treatment geometry or tail mass.
To make that concrete, I prompted `gpt-5.4` and `gpt-5.4-mini` with eight observed Lalonde/CPS-style rows and asked for eight additional synthetic rows in CSV only. The models returned:
**`gpt-5.4`**
```text
t,age,education,black,hispanic,married,nodegree,re74,re75,re78
1.0,24.0,10.0,1.0,0.0,0.0,1.0,0.0,0.0,4210.337
1.0,31.0,12.0,0.0,1.0,1.0,0.0,0.0,1567.442,10854.229
1.0,20.0,9.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
1.0,27.0,11.0,0.0,0.0,0.0,1.0,1843.556,0.0,7321.884
0.0,39.0,12.0,0.0,0.0,1.0,0.0,18654.22,19487.63,22105.48
0.0,23.0,13.0,0.0,1.0,0.0,0.0,6421.318,7135.904,12844.67
0.0,41.0,10.0,1.0,0.0,1.0,1.0,11254.77,9873.441,15432.98
0.0,29.0,15.0,0.0,0.0,0.0,0.0,14387.56,16842.11,19765.34
```
**`gpt-5.4-mini`**
```text
t,age,education,black,hispanic,married,nodegree,re74,re75,re78
1.0,31.0,10.0,1.0,0.0,0.0,1.0,0.0,0.0,7421.563
1.0,24.0,12.0,0.0,1.0,0.0,0.0,1250.0,0.0,4688.214
1.0,40.0,8.0,1.0,0.0,1.0,1.0,0.0,0.0,10345.77
1.0,27.0,13.0,0.0,0.0,0.0,0.0,540.25,1120.5,6890.332
0.0,38.0,12.0,1.0,0.0,1.0,0.0,18420.75,20110.34,21987.6
0.0,29.0,9.0,0.0,1.0,0.0,1.0,4320.18,3895.62,5120.44
0.0,50.0,10.0,1.0,0.0,1.0,1.0,16750.0,15890.25,17120.89
0.0,23.0,15.0,0.0,0.0,0.0,0.0,6200.0,8450.5,12675.31
```
These completions are not ridiculous. They respect basic support constraints, preserve the visual distinction between treated and control earnings histories, and produce rows that a human reader might accept on a skim. But that is exactly why ICL is a useful foil. The outputs are plausible at the row level while remaining uncalibrated at the distribution level. There is no guarantee that they preserve the dependence structure that makes an estimator difficult, and no clean way to map prompt engineering choices into a simulation design argument.
For now, ICL is best treated as a lightweight stress test or idea generator. It is not yet a replacement for a fitted simulator with explicit diagnostics, reproducible sampling, and a well-defined conditioning object.
### Trex API Surface
Trex exposes WGAN and DDPM simulators through the same minimal interface: fit on a matrix of transformed tabular rows, optionally condition on a context matrix, and sample synthetic transformed rows. The fitted rows are transformed back to data scale with `TabularTransformer`.
For the WGAN benchmark, the class is `trex.TabularWGAN`:
```python
TabularWGAN(
hidden_dims=(128, 128, 128),
critic_hidden_dims=None,
noise_dim=None,
batch_size=128,
max_steps=1000,
critic_steps=5,
lr=1e-4,
optimizer="adam", # or "optimistic_adam"
gp_weight=5.0,
generator_dropout=0.1,
critic_dropout=0.0,
binary_dims=(),
lower_bounds=None,
upper_bounds=None,
seed=None,
device=None,
)
wgan.fit(X, context=None)
fake_z = wgan.sample(n, context=None)
```
For the DDPM benchmark, the class is `trex.TabularDiffusion`:
```python
TabularDiffusion(
hidden_dims=(128, 128, 128),
time_dim=32,
n_timesteps=100,
beta_start=1e-4,
beta_end=0.02,
batch_size=128,
max_steps=1000,
lr=1e-3,
dropout=0.0,
seed=None,
device=None,
)
ddpm.fit(X, context=None)
fake_z = ddpm.sample(n, context=None)
```
A typical tabular call is:
```python
transformer = TabularTransformer(
column_names=columns,
binary_columns=binary_columns,
nonnegative_columns=nonnegative_columns,
).fit(df[columns])
X = transformer.transform(df[columns])
context = df[["t"]].to_numpy()
ddpm = TabularDiffusion(max_steps=1000, seed=123, device="cuda")
ddpm.fit(X, context=context)
fake = transformer.inverse_transform(
ddpm.sample(len(df), context=context).numpy()
)
```
This branch does not currently expose a `TabularFlow` class. A normalizing-flow simulator would fit the same interface naturally:
```python
flow.fit(X, context=None)
fake_z = flow.sample(n, context=None)
logp = flow.log_prob(X, context=None)
```
The extra method is `log_prob`, since exact or tractable likelihood evaluation is the main reason to add a flow baseline. The implementation choice would then be a conditional spline flow, masked autoregressive flow, or coupling flow depending on whether tabular likelihood or fast sampling is the priority.
## Distance Metrics
Let $X_1,\ldots,X_n \in \mathbb{R}^d$ denote observed rows and $\tilde X_1,\ldots,\tilde X_m \in \mathbb{R}^d$ denote synthetic rows. All diagnostics below are computed after the same preprocessing convention used for the simulation exercise; in the Lalonde application, earnings are measured in thousands of dollars. Lower values indicate closer agreement.
These are still generator-native diagnostics, so they need interpretation. In a causal simulation study, the object is not generic sample quality. The object is whether the synthetic design preserves overlap, treatment imbalance, leverage points, tail behavior, and the nuisance-function geometry that drives finite-sample estimator risk. The metrics below are therefore proxies, not sufficient statistics. A good simulation design should be able to explain which feature each metric is meant to guard.
For coordinate $j$, the one-dimensional Wasserstein distance is
$$
W_{1j}
=
\int_0^1
\left|
\hat F_{j}^{-1}(u)
-
\hat G_{j}^{-1}(u)
\right|du,
$$
where $\hat F_j$ and $\hat G_j$ are the empirical CDFs of observed and synthetic column $j$. The reported `marginal_w1_mean` and `marginal_w1_max` are
$$
\frac{1}{d}\sum_{j=1}^d W_{1j},
\qquad
\max_{1 \leq j \leq d} W_{1j}.
$$
The Kolmogorov-Smirnov distance for coordinate $j$ is
$$
K_j
=
\sup_x
\left|
\hat F_j(x) - \hat G_j(x)
\right|.
$$
The reported `marginal_ks_mean` and `marginal_ks_max` average and maximize $K_j$ across columns. The KS distance is scale-free, while Wasserstein distance preserves the units of each variable.
The mean-distance diagnostic is
$$
\text{mean\_l2}
=
\left\|
\bar X - \bar{\tilde X}
\right\|_2.
$$
It is a location check. It can be small even when tails or dependence are wrong.
Let $\hat\Sigma_X$ and $\hat\Sigma_{\tilde X}$ be empirical covariance matrices. The covariance diagnostic is
$$
\text{cov\_frobenius}
=
\left\|
\hat\Sigma_X - \hat\Sigma_{\tilde X}
\right\|_F.
$$
Because covariance is measured in original units, high-variance earnings columns can dominate it. The correlation diagnostic repeats the same calculation after converting covariance matrices into correlation matrices:
$$
\text{corr\_frobenius}
=
\left\|
\hat R_X - \hat R_{\tilde X}
\right\|_F.
$$
This is the cleaner summary of linear dependence across mixed-scale columns.
Finally, sliced Wasserstein averages one-dimensional Wasserstein distances over random projections. For a unit vector $u \in \mathbb{S}^{d-1}$, project rows to $u'X_i$ and $u'\tilde X_i$. The population target is
$$
\text{SW}_1(X,\tilde X)
=
\mathbb{E}_{u \sim \operatorname{Unif}(\mathbb{S}^{d-1})}
\left[
W_1(u'X, u'\tilde X)
\right].
$$
The reported value approximates this expectation with random directions. This is useful in tabular simulation because it can detect joint-distribution failures that are invisible in marginal diagnostics, while remaining much cheaper than exact high-dimensional optimal transport.
## Simulation Protocol
A useful DDPM simulator for econometrics should make the target distribution explicit. The following workflow is close to the AIMM proposal but swaps the fitting engine:
1. Choose the empirical target: unconditional rows, rows conditional on treatment, rows within market, or a panel transition distribution.
2. Define a preprocessing map $S$ and record all inverse transformations.
3. Decide which variables are generated and which are fixed as context.
4. Fit a denoiser with a transparent noise schedule and held-out diagnostics.
5. Generate synthetic samples with the same sample size, treatment shares, panel length, or market composition as the empirical design.
6. Report distances between real and generated data: marginal Wasserstein, KS distances, mean distance, covariance and correlation Frobenius norms, sliced Wasserstein, and downstream estimand replication.
7. Run estimator comparisons on many generated datasets.
8. Add ablations by changing one empirical feature at a time: overlap, tails, treatment imbalance, covariance, policy rules, or outcome noise.
The ablation step is where diffusion can be especially useful. Because generation is a gradual reverse process, one can condition, guide, or project intermediate draws. Examples include:
- holding treatment fixed and generating only $(Y,W)$ conditional on $D$;
- matching marginal treatment shares exactly while allowing covariates and outcomes to vary;
- projecting generated rows back into known support restrictions;
- increasing or decreasing tail noise to stress-test robust estimators;
- conditioning on policy environments to compare counterfactual regimes.
The relevant standard is practical: does the DDPM improve the credibility of finite-sample estimator comparisons?
## Empirical Illustration
### Iris Sanity Check
The Iris dataset is too small and too clean to be an econometric benchmark. It is useful here because it lets us see the mechanics without hiding behind a large application. The row is
$$
x_i =
(\text{sepal length}, \text{sepal width}, \text{petal length}, \text{petal width}, \text{species}).
$$
The exercise fixes the species counts and learns the conditional distribution of the four continuous measurements given species. This is the same design pattern one would use when treatment status or site membership is part of the simulation design rather than a variable the generator is allowed to change.
```{python}
#| label: setup
#| echo: true
#| output: false
from __future__ import annotations
import sys
import time
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
from scipy.stats import wasserstein_distance
from sklearn.datasets import load_breast_cancer, load_diabetes, load_iris, load_wine
from sklearn.model_selection import train_test_split
TREX_ROOT = Path("/Users/alal/Desktop/code/trex")
if str(TREX_ROOT) not in sys.path:
sys.path.insert(0, str(TREX_ROOT))
from trex import TabularDiffusion, TabularTransformer, TabularWGAN, distribution_metrics
SEED = 20260526
rng = np.random.default_rng(SEED)
torch.manual_seed(SEED)
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
RESULTS_DIR = Path("results")
RESULTS_DIR.mkdir(exist_ok=True)
def one_hot(values: pd.Series | np.ndarray) -> np.ndarray:
values = np.asarray(values, dtype=int)
levels = np.sort(np.unique(values))
return np.column_stack([(values == level).astype(float) for level in levels])
def tidy_metrics(real: pd.DataFrame, fake: pd.DataFrame, columns: list[str], label: str) -> pd.DataFrame:
out = distribution_metrics(real[columns].to_numpy(), fake[columns].to_numpy(), seed=SEED)
return pd.DataFrame([{"sample": label, **out}])
```
```{python}
#| label: iris-fit
iris = load_iris(as_frame=True)
iris_df = iris.frame.copy()
iris_df.columns = [
"sepal_length",
"sepal_width",
"petal_length",
"petal_width",
"species",
]
iris_df["species"] = iris_df["species"].astype(int)
iris_features = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
iris_context = one_hot(iris_df["species"])
iris_transformer = TabularTransformer(
column_names=iris_features,
nonnegative_columns=iris_features,
).fit(iris_df[iris_features])
iris_z = iris_transformer.transform(iris_df[iris_features])
iris_lower, iris_upper = iris_transformer.transformed_bounds()
iris_wgan = TabularWGAN(
hidden_dims=(64, 64),
critic_hidden_dims=(64, 64),
batch_size=64,
max_steps=500,
critic_steps=2,
lr=1e-4,
optimizer="optimistic_adam",
lower_bounds=iris_lower,
upper_bounds=iris_upper,
seed=SEED,
device=DEVICE,
)
start = time.perf_counter()
iris_wgan.fit(iris_z, context=iris_context)
iris_wgan_seconds = time.perf_counter() - start
iris_wgan_z = iris_wgan.sample(len(iris_df), context=iris_context).numpy()
iris_wgan_fake = pd.DataFrame(
iris_transformer.inverse_transform(iris_wgan_z),
columns=iris_features,
)
iris_wgan_fake["species"] = iris_df["species"].to_numpy()
iris_ddpm = TabularDiffusion(
hidden_dims=(64, 64),
n_timesteps=50,
max_steps=700,
batch_size=64,
lr=1e-3,
seed=SEED,
device=DEVICE,
)
start = time.perf_counter()
iris_ddpm.fit(iris_z, context=iris_context)
iris_seconds = time.perf_counter() - start
iris_fake_z = iris_ddpm.sample(len(iris_df), context=iris_context).numpy()
iris_fake = pd.DataFrame(
iris_transformer.inverse_transform(iris_fake_z),
columns=iris_features,
)
iris_fake["species"] = iris_df["species"].to_numpy()
iris_metrics = pd.concat(
[
tidy_metrics(
iris_df,
iris_wgan_fake,
[*iris_features, "species"],
f"WGAN synthetic, {iris_wgan_seconds:.1f}s",
),
tidy_metrics(
iris_df,
iris_fake,
[*iris_features, "species"],
f"DDPM synthetic, {iris_seconds:.1f}s",
),
],
ignore_index=True,
)
iris_metrics.to_csv(RESULTS_DIR / "iris_metrics.csv", index=False)
iris_metrics[[
"sample",
"marginal_w1_mean",
"marginal_ks_mean",
"mean_l2",
"cov_frobenius",
"corr_frobenius",
"sliced_wasserstein",
]].round(4).style.hide(axis="index")
```
```{python}
#| label: iris-plot
#| fig-width: 9
#| fig-height: 3.4
species_names = dict(enumerate(iris.target_names))
fig, axes = plt.subplots(1, 3, figsize=(9, 3.4), sharex=True, sharey=True, constrained_layout=True)
for ax, data, title in [
(axes[0], iris_df, "Observed Iris"),
(axes[1], iris_wgan_fake, "WGAN synthetic"),
(axes[2], iris_fake, "DDPM synthetic"),
]:
for species, group in data.groupby("species"):
ax.scatter(
group["sepal_length"],
group["petal_length"],
s=28,
alpha=0.75,
label=species_names[int(species)],
)
ax.set_title(title)
ax.set_xlabel("sepal length")
axes[0].set_ylabel("petal length")
axes[2].legend(frameon=False, loc="lower right")
plt.show()
```
The Iris example is deliberately modest. Its diagnostic role is to check whether small conditional generators preserve species-conditional clusters and produce plausible continuous measurements. The WGAN panel shows the adversarial baseline; the DDPM panel shows the same task fit through supervised denoising.
### Held-Out Distance Benchmarks
For reporting generator quality, a held-out comparison is usually preferable to comparing synthetic rows against the same empirical rows used for fitting. The in-sample distance answers whether the generator can reproduce the training distribution. The held-out distance asks a more relevant question: after fitting on one sample, how close are generated rows to new rows from the same empirical source?
The following benchmark uses small numeric datasets bundled with scikit-learn. For each dataset, the models fit only the training rows. They then generate the same number of rows as the test split, and distances are computed against the held-out test rows. A train-resample baseline draws rows with replacement from the training set; it is not a learned generator, but it gives a useful finite-sample reference.
```{python}
#| label: sklearn-heldout
def sklearn_frame(loader, name: str) -> pd.DataFrame:
data = loader(as_frame=True)
frame = data.frame.copy()
if "target" in frame.columns:
frame = frame.drop(columns=["target"])
frame.columns = [f"x{j}" for j in range(frame.shape[1])]
frame = frame.astype(float)
frame.attrs["dataset_name"] = name
return frame
sklearn_datasets = [
sklearn_frame(load_iris, "iris features"),
sklearn_frame(load_wine, "wine features"),
sklearn_frame(load_breast_cancer, "breast cancer features"),
sklearn_frame(load_diabetes, "diabetes features"),
]
def fit_small_generators(train: pd.DataFrame, test: pd.DataFrame, seed: int) -> pd.DataFrame:
columns = list(train.columns)
transformer = TabularTransformer(
column_names=columns,
nonnegative_columns=columns,
).fit(train)
z_train = transformer.transform(train)
lower, upper = transformer.transformed_bounds()
n_test = len(test)
wgan_model = TabularWGAN(
hidden_dims=(48, 48),
critic_hidden_dims=(48, 48),
batch_size=min(128, len(train)),
max_steps=250,
critic_steps=2,
lr=1e-4,
optimizer="optimistic_adam",
lower_bounds=lower,
upper_bounds=upper,
seed=seed,
device=DEVICE,
)
ddpm_model = TabularDiffusion(
hidden_dims=(48, 48),
n_timesteps=50,
batch_size=min(128, len(train)),
max_steps=300,
lr=1e-3,
seed=seed,
device=DEVICE,
)
wgan_model.fit(z_train)
ddpm_model.fit(z_train)
train_resample = train.sample(n_test, replace=True, random_state=seed).reset_index(drop=True)
wgan_fake = pd.DataFrame(
transformer.inverse_transform(wgan_model.sample(n_test).numpy()),
columns=columns,
)
ddpm_fake = pd.DataFrame(
transformer.inverse_transform(ddpm_model.sample(n_test).numpy()),
columns=columns,
)
rows = []
for label, fake in [
("train resample", train_resample),
("WGAN", wgan_fake),
("DDPM", ddpm_fake),
]:
metrics = distribution_metrics(
test[columns].to_numpy(),
fake[columns].to_numpy(),
n_projections=64,
seed=seed,
)
rows.append({"model": label, **metrics})
return pd.DataFrame(rows)
heldout_rows = []
for k, frame in enumerate(sklearn_datasets):
train, test = train_test_split(frame, test_size=0.30, random_state=SEED + k)
result = fit_small_generators(
train.reset_index(drop=True),
test.reset_index(drop=True),
SEED + 100 * k,
)
result.insert(0, "dataset", frame.attrs["dataset_name"])
result.insert(1, "n_train", len(train))
result.insert(2, "n_test", len(test))
heldout_rows.append(result)
heldout_table = pd.concat(heldout_rows, ignore_index=True)
heldout_table.to_csv(RESULTS_DIR / "sklearn_heldout_metrics.csv", index=False)
heldout_table[[
"dataset",
"n_train",
"n_test",
"model",
"marginal_w1_mean",
"marginal_ks_mean",
"mean_l2",
"corr_frobenius",
"sliced_wasserstein",
]].round(4).style.hide(axis="index")
```
```{python}
#| label: sklearn-heldout-plot
#| fig-width: 8
#| fig-height: 3.6
fig, ax = plt.subplots(figsize=(8, 3.6), constrained_layout=True)
plot_data = heldout_table.pivot(index="dataset", columns="model", values="sliced_wasserstein")
plot_data = plot_data[["train resample", "WGAN", "DDPM"]]
x = np.arange(len(plot_data))
width = 0.24
for j, model in enumerate(plot_data.columns):
ax.bar(x + (j - 1) * width, plot_data[model], width=width, label=model)
ax.set_xticks(x)
ax.set_xticklabels(plot_data.index, rotation=20, ha="right")
ax.set_ylabel("sliced Wasserstein to held-out test rows")
ax.set_title("Held-out row-distribution distance")
ax.legend(frameon=False, ncols=3)
plt.show()
```
Held-out distances are noisy on these small datasets, but they are conceptually cleaner than in-sample distances. A generator should be judged against rows it did not see during fitting, and the train-resample baseline helps distinguish model failure from the ordinary sampling variability of a small empirical distribution.
### Lalonde/CPS Design
The AIMM paper uses WGANs to build simulation designs for treatment-effect estimators. The canonical version of that problem starts from the Lalonde/Dehejia-Wahba treated sample and a large observational control pool, then asks whether nonexperimental estimators recover an experimental benchmark. The row distribution is difficult for simple parametric simulation because treatment is rare, earnings are heavy-tailed and truncated, and the relationship between demographics and earnings is not well described by a small Gaussian model.
The exercise below loads the local `ds-wgan` Lalonde/CPS data, scales earnings to thousands of dollars, fits a conditional WGAN and a conditional DDPM, and evaluates both as simulation DGPs. Treatment status is held fixed as context, so each generator learns the conditional distribution of demographics and earnings given treatment. If the convenience `cps_merged.feather` file is absent, the code reconstructs it from `ds-wgan/data/original_data/cps.feather` and `experimental.feather`; these two files are the actual source data.
```{python}
#| label: lalonde-load
#| output: false
LDW_COLUMNS = [
"t", "age", "education", "black", "hispanic",
"married", "nodegree", "re74", "re75", "re78",
]
EARNINGS_COLUMNS = ["re74", "re75", "re78"]
BINARY_COLUMNS = ["black", "hispanic", "married", "nodegree"]
NONNEGATIVE_COLUMNS = ["age", "education", "re74", "re75", "re78"]
COVARIATES = ["age", "education", "black", "hispanic", "married", "nodegree", "re74", "re75"]
OUTCOME = "re78"
TREATMENT = "t"
code_root = TREX_ROOT.parent
merged_path = code_root / "dswgan-paper" / "data" / "original_data" / "cps_merged.feather"
cps_path = code_root / "ds-wgan" / "data" / "original_data" / "cps.feather"
experimental_path = code_root / "ds-wgan" / "data" / "original_data" / "experimental.feather"
if cps_path.exists() and experimental_path.exists():
cps = pd.read_feather(cps_path)
experimental = pd.read_feather(experimental_path)
if list(cps.columns) != list(experimental.columns):
raise ValueError("cps.feather and experimental.feather column mismatch")
cps_t_values = set(pd.Series(cps[TREATMENT]).dropna().astype(float).unique().tolist())
if cps_t_values == {0.0, 1.0}:
lalonde_raw = cps.copy()
data_source = "ds-wgan cps.feather, already containing treated and control rows"
else:
treated = experimental.loc[experimental[TREATMENT].astype(float) == 1.0]
controls = cps.loc[cps[TREATMENT].astype(float) == 0.0]
lalonde_raw = pd.concat([treated, controls], ignore_index=True)
data_source = "ds-wgan experimental treated rows stacked with ds-wgan cps controls"
merged_path.parent.mkdir(parents=True, exist_ok=True)
lalonde_raw.reset_index(drop=True).to_feather(merged_path)
data_path = merged_path
elif merged_path.exists():
data_path = merged_path
lalonde_raw = pd.read_feather(data_path)
data_source = "fallback existing cps_merged.feather convenience file"
else:
raise FileNotFoundError(
"Expected local ds-wgan CPS sources at "
f"{cps_path} and {experimental_path}; merged convenience path is {merged_path}"
)
lalonde = lalonde_raw[LDW_COLUMNS].astype(float)
lalonde[EARNINGS_COLUMNS] = lalonde[EARNINGS_COLUMNS] / 1000.0
lalonde_summary = (
lalonde[TREATMENT]
.value_counts()
.rename_axis("treatment")
.reset_index(name="n")
.assign(share=lambda d: d["n"] / d["n"].sum())
)
treated_n = int(lalonde[TREATMENT].sum())
control_n = int(len(lalonde) - treated_n)
treated_share = treated_n / len(lalonde)
```
The Lalonde/CPS sample has 16,177 rows, with 185 treated observations and 15,992 controls; the treated share is 1.14 percent. Earnings are divided by 1,000 before fitting so that the neural models are not dominated by raw dollar scales. In this render, the loaded source is recorded by `data_source`; the source files are the local `ds-wgan` CPS/experimental feathers, with `cps_merged.feather` only a reproducible convenience copy.
```{python}
#| label: lalonde-provenance
provenance_table = pd.DataFrame([
{
"loaded_source": data_source,
"rows": len(lalonde),
"treated": treated_n,
"controls": control_n,
"treated_share": treated_share,
"cps_source_exists": cps_path.exists(),
"experimental_source_exists": experimental_path.exists(),
"merged_convenience_path": str(merged_path),
}
])
provenance_table.to_csv(RESULTS_DIR / "lalonde_data_provenance.csv", index=False)
provenance_table.style.hide(axis="index")
```
```{python}
#| label: lalonde-estimand
#| output: false
def design_matrix(df: pd.DataFrame) -> np.ndarray:
return np.column_stack([np.ones(len(df)), df[COVARIATES].to_numpy(dtype=float)])
def regression_adjustment_att(df: pd.DataFrame) -> dict[str, float | int | str]:
data = df[LDW_COLUMNS].astype(float).copy()
t = data[TREATMENT].to_numpy(dtype=int)
y = data[OUTCOME].to_numpy(dtype=float)
X = design_matrix(data)
treated = t == 1
control = t == 0
if treated.sum() < 5 or control.sum() <= X.shape[1] + 1:
return {
"att_thousand": np.nan,
"att_dollars": np.nan,
"se_thousand": np.nan,
"se_dollars": np.nan,
"n": len(data),
"n_treated": int(treated.sum()),
"n_control": int(control.sum()),
"status": "insufficient support",
}
beta0 = np.linalg.lstsq(X[control], y[control], rcond=None)[0]
treated_effects = y[treated] - X[treated] @ beta0
att = float(treated_effects.mean())
se = float(treated_effects.std(ddof=1) / np.sqrt(treated.sum()))
return {
"att_thousand": att,
"att_dollars": 1000.0 * att,
"se_thousand": se,
"se_dollars": 1000.0 * se,
"n": len(data),
"n_treated": int(treated.sum()),
"n_control": int(control.sum()),
"status": "ok",
}
def result_row(name: str, df: pd.DataFrame) -> dict[str, float | int | str]:
row = {"sample": name, **regression_adjustment_att(df)}
row["treated_share"] = row["n_treated"] / row["n"]
return row
```
```{python}
#| label: lalonde-fit
#| output: false
feature_columns = [column for column in LDW_COLUMNS if column != TREATMENT]
feature_binary = [column for column in BINARY_COLUMNS if column in feature_columns]
feature_nonnegative = [column for column in NONNEGATIVE_COLUMNS if column in feature_columns]
features = lalonde[feature_columns].copy()
context = lalonde[[TREATMENT]].to_numpy(dtype=np.float32)
lalonde_transformer = TabularTransformer(
column_names=feature_columns,
binary_columns=feature_binary,
nonnegative_columns=feature_nonnegative,
).fit(features)
train_z = lalonde_transformer.transform(features)
lower, upper = lalonde_transformer.transformed_bounds()
def rebuild_lalonde(generated_features: np.ndarray, generated_t: np.ndarray) -> pd.DataFrame:
out = pd.DataFrame(generated_features, columns=feature_columns)
out.insert(0, TREATMENT, generated_t.reshape(-1).astype(float))
out = out[LDW_COLUMNS].astype(float)
for col in [TREATMENT, *BINARY_COLUMNS]:
out[col] = (np.clip(out[col], 0.0, 1.0) >= 0.5).astype(float)
for col in NONNEGATIVE_COLUMNS:
out[col] = np.maximum(out[col], 0.0)
return out
wgan = TabularWGAN(
hidden_dims=(64, 64),
critic_hidden_dims=(64, 64),
batch_size=512,
max_steps=450,
critic_steps=2,
lr=1e-4,
optimizer="optimistic_adam",
gp_weight=5.0,
binary_dims=[feature_columns.index(col) for col in feature_binary],
lower_bounds=lower,
upper_bounds=upper,
seed=SEED,
device=DEVICE,
)
ddpm = TabularDiffusion(
hidden_dims=(64, 64),
n_timesteps=80,
batch_size=512,
max_steps=650,
lr=1e-3,
seed=SEED,
device=DEVICE,
)
start = time.perf_counter()
wgan.fit(train_z, context=context)
wgan_seconds = time.perf_counter() - start
start = time.perf_counter()
ddpm.fit(train_z, context=context)
ddpm_seconds = time.perf_counter() - start
wgan_fake = rebuild_lalonde(
lalonde_transformer.inverse_transform(wgan.sample(len(lalonde), context=context).numpy()),
context,
)
ddpm_fake = rebuild_lalonde(
lalonde_transformer.inverse_transform(ddpm.sample(len(lalonde), context=context).numpy()),
context,
)
fit_times = pd.DataFrame([
{"model": "WGAN-GP", "seconds": wgan_seconds},
{"model": "DDPM", "seconds": ddpm_seconds},
])
```
The timing comparison is included only to show the practical cost of the two fitting routines in this small benchmark. It is not the main estimand of the exercise.
```{python}
#| label: fit-time-plot
#| fig-width: 5
#| fig-height: 2.4
fig, ax = plt.subplots(figsize=(5, 2.4), constrained_layout=True)
ax.bar(fit_times["model"], fit_times["seconds"])
ax.set_ylabel("seconds")
ax.set_title("Generator fit time")
plt.show()
```
```{python}
#| label: lalonde-results
att_table = pd.DataFrame([
result_row("Observed CPS", lalonde),
result_row("WGAN-GP synthetic", wgan_fake),
result_row("DDPM synthetic", ddpm_fake),
])
metric_table = pd.concat([
tidy_metrics(lalonde, wgan_fake, LDW_COLUMNS, "WGAN-GP synthetic"),
tidy_metrics(lalonde, ddpm_fake, LDW_COLUMNS, "DDPM synthetic"),
], ignore_index=True)
att_table.to_csv(RESULTS_DIR / "lalonde_att_table.csv", index=False)
metric_table.to_csv(RESULTS_DIR / "lalonde_metric_table.csv", index=False)
att_table.round(4).style.hide(axis="index")
```
```{python}
#| label: lalonde-metrics
metric_table[[
"sample",
"marginal_w1_mean",
"marginal_ks_mean",
"mean_l2",
"cov_frobenius",
"corr_frobenius",
"sliced_wasserstein",
]].round(4).style.hide(axis="index")
```
The ATT and distance tables separate two questions. The ATT table asks whether the downstream regression-adjustment estimand is similar. The distance table asks whether the generated rows resemble the empirical CPS rows before any estimator is applied. In the rendered run, the DDPM is closer on every reported distributional distance: marginal Wasserstein, marginal KS, mean distance, covariance distance, correlation distance, and sliced Wasserstein. This is the main evidence that the DDPM is giving a more faithful simulation DGP than the WGAN in this application.
The sliced Wasserstein distance is especially useful here because it is multivariate. It projects the full row vector onto many random directions and averages the one-dimensional Wasserstein distances. A generator can match each column separately while getting the joint geometry wrong; sliced Wasserstein is meant to catch that failure without solving a full high-dimensional optimal-transport problem. The Frobenius distances for covariance and correlation matrices provide a more classical second-moment version of the same check.
```{python}
#| label: lalonde-plot
#| fig-width: 9
#| fig-height: 3.4
plot_items = [
("Observed CPS", lalonde),
("WGAN-GP synthetic", wgan_fake),
("DDPM synthetic", ddpm_fake),
]
fig, axes = plt.subplots(1, 3, figsize=(9, 3.4), sharex=True, sharey=True, constrained_layout=True)
for ax, (title, df) in zip(axes, plot_items):
sample = df.sample(min(4000, len(df)), random_state=SEED)
ax.scatter(sample["re75"], sample["re78"], s=5, alpha=0.16)
ax.set_title(title)
ax.set_xlabel("1975 earnings, thousands")
axes[0].set_ylabel("1978 earnings, thousands")
plt.show()
```
One synthetic dataset is only a first pass. A generator can match the ATT and still miss the row distribution. The distance table and scatter plot are therefore part of the simulation design, not decoration. The DDPM is attractive if it gives a better joint distribution while preserving estimator-relevant features such as treatment imbalance, heavy earnings tails, and the dependence between pre-treatment and post-treatment earnings.
### Repeated Simulation Results
The single synthetic draw above answers whether one fitted generator can produce a plausible CPS-like table. A simulation study also needs repeated draws from the fitted generator. The estimand and sample geometry are held fixed: every synthetic dataset has the observed sample size and the observed treatment vector. This isolates the fitted row distribution from changes in treatment prevalence.
The repeated design is:
$$
\tilde X_b^{(m)}
\sim
\hat P_m(\cdot \mid D = d_i),
\qquad
i=1,\ldots,n,\quad
b=1,\ldots,B,
$$
where $m \in \{\text{WGAN}, \text{DDPM}\}$ indexes the generator and $d_i$ is the observed treatment assignment for row $i$. Each replicate is evaluated by the same regression-adjustment ATT and by the same distribution distances used above.
```{python}
#| label: repeated-simulation
def synthetic_from_wgan(context_array: np.ndarray) -> pd.DataFrame:
generated = wgan.sample(len(context_array), context=context_array).numpy()
return rebuild_lalonde(
lalonde_transformer.inverse_transform(generated),
context_array,
)
def synthetic_from_ddpm(context_array: np.ndarray) -> pd.DataFrame:
generated = ddpm.sample(len(context_array), context=context_array).numpy()
return rebuild_lalonde(
lalonde_transformer.inverse_transform(generated),
context_array,
)
def simulation_row(model_name: str, replicate: int, df: pd.DataFrame) -> dict[str, float | int | str]:
att = result_row(model_name, df)
metrics = distribution_metrics(
lalonde[LDW_COLUMNS].to_numpy(),
df[LDW_COLUMNS].to_numpy(),
n_projections=64,
seed=SEED + replicate,
)
return {
"model": model_name,
"replicate": replicate,
"att_dollars": att["att_dollars"],
"se_dollars": att["se_dollars"],
"att_error_dollars": att["att_dollars"] - regression_adjustment_att(lalonde)["att_dollars"],
"sliced_wasserstein": metrics["sliced_wasserstein"],
"marginal_ks_mean": metrics["marginal_ks_mean"],
"corr_frobenius": metrics["corr_frobenius"],
"cov_frobenius": metrics["cov_frobenius"],
}
N_REPLICATES = 12
simulation_rows = []
for b in range(N_REPLICATES):
simulation_rows.append(simulation_row("WGAN-GP", b, synthetic_from_wgan(context)))
simulation_rows.append(simulation_row("DDPM", b, synthetic_from_ddpm(context)))
simulation_table = pd.DataFrame(simulation_rows)
simulation_summary = (
simulation_table
.groupby("model")
.agg(
att_mean_dollars=("att_dollars", "mean"),
att_sd_dollars=("att_dollars", "std"),
abs_att_error_mean=("att_error_dollars", lambda x: np.mean(np.abs(x))),
sliced_wasserstein_mean=("sliced_wasserstein", "mean"),
marginal_ks_mean=("marginal_ks_mean", "mean"),
corr_frobenius_mean=("corr_frobenius", "mean"),
cov_frobenius_mean=("cov_frobenius", "mean"),
)
.reset_index()
)
simulation_table.to_csv(RESULTS_DIR / "lalonde_repeated_draws.csv", index=False)
simulation_summary.to_csv(RESULTS_DIR / "lalonde_repeated_summary.csv", index=False)
simulation_summary.round(4).style.hide(axis="index")
```
The repeated draws turn the generator comparison into a Monte Carlo object. The ATT columns ask whether the downstream estimator sees a similar treatment-effect problem. The distance columns ask whether the synthetic tables reproduce the empirical distribution that makes the treatment-effect problem hard. In this run, the WGAN has a lower average absolute ATT error, while the DDPM is much closer on the joint distribution metrics. That split is informative: an estimator-specific target can look acceptable even when the generated table is too far from the empirical row distribution. For simulation design, the DDPM is the more credible row-distribution baseline in this run because it preserves more of the distribution that future estimators will actually face. For estimator-specific calibration, the WGAN result is a warning: a generator can be less faithful to the table and still hit one downstream target more closely.
```{python}
#| label: repeated-simulation-plot
#| fig-width: 8
#| fig-height: 3.6
fig, axes = plt.subplots(1, 2, figsize=(8, 3.6), constrained_layout=True)
for model, group in simulation_table.groupby("model"):
axes[0].scatter(
np.full(len(group), 0 if model == "WGAN-GP" else 1) + rng.normal(0, 0.035, len(group)),
group["att_dollars"],
alpha=0.75,
s=26,
label=model,
)
axes[1].scatter(
np.full(len(group), 0 if model == "WGAN-GP" else 1) + rng.normal(0, 0.035, len(group)),
group["sliced_wasserstein"],
alpha=0.75,
s=26,
label=model,
)
axes[0].axhline(regression_adjustment_att(lalonde)["att_dollars"], color="black", lw=1, ls="--")
axes[0].set_title("ATT across synthetic datasets")
axes[0].set_ylabel("ATT, dollars")
axes[1].set_title("Sliced Wasserstein")
axes[1].set_ylabel("distance")
for ax in axes:
ax.set_xticks([0, 1])
ax.set_xticklabels(["WGAN", "DDPM"])
plt.show()
```
### Treatment-Share Ablation
The CPS target has only 185 treated observations out of 16,177 rows. That imbalance is part of the empirical problem. A learned conditional generator lets us change this feature while keeping the fitted conditional row distributions fixed.
The ablation below draws a balanced sample with 500 treated and 500 controls from each fitted generator. This does not estimate a new population parameter. It asks how the regression-adjustment estimand behaves when the rare-treatment geometry is relaxed while the conditional outcome and covariate distributions are generated by the same fitted model.
```{python}
#| label: treatment-share-ablation
balanced_context = np.r_[np.ones((500, 1)), np.zeros((500, 1))].astype(np.float32)
wgan_balanced = synthetic_from_wgan(balanced_context)
ddpm_balanced = synthetic_from_ddpm(balanced_context)
ablation_table = pd.DataFrame([
result_row("WGAN-GP observed share", wgan_fake),
result_row("DDPM observed share", ddpm_fake),
result_row("WGAN-GP balanced 500/500", wgan_balanced),
result_row("DDPM balanced 500/500", ddpm_balanced),
])
ablation_table.to_csv(RESULTS_DIR / "lalonde_treatment_share_ablation.csv", index=False)
ablation_table.round(4).style.hide(axis="index")
```
```{python}
#| label: treatment-share-plot
#| fig-width: 8
#| fig-height: 3.6
ablation_plot = ablation_table.query("sample != 'Observed CPS'").copy()
fig, ax = plt.subplots(figsize=(8, 3.6), constrained_layout=True)
ax.barh(ablation_plot["sample"], ablation_plot["att_dollars"])
ax.axvline(regression_adjustment_att(lalonde)["att_dollars"], color="black", lw=1, ls="--")
ax.set_xlabel("regression-adjustment ATT, dollars")
ax.set_title("ATT under treatment-share ablation")
plt.show()
```
This is the kind of ablation that learned simulation designs should make routine. The empirical target is used to fit the conditional row distribution; the analyst then changes one design feature and watches the downstream estimator respond. More substantive ablations would alter overlap, tail thickness, panel persistence, market concentration, or the outcome equation. The same implementation pattern applies: hold the fitted conditional generator fixed, change the design object deliberately, and report both distributional diagnostics and estimator consequences.
The balanced-sample exercise is a controlled perturbation of the simulation environment rather than an estimate of the original CPS estimand. The observed-share synthetic rows keep the empirical 1.14 percent treated share, while the balanced ablation has a treated share of 50 percent. The negative DDPM value in the ablation table is an estimated ATT in dollars, not a treatment share. It flags an important sensitivity: the conditional treated and control distributions learned from a rare-treatment sample should not be extrapolated casually to a balanced population without additional diagnostics.
## Limitations
Diffusion models do not remove judgment from simulation design. They move the judgment to preprocessing, conditioning, diagnostics, and ablation design. That is still progress, because those choices are visible and auditable.
There are also limits:
- a DDPM trained on one empirical dataset cannot invent support that is absent from that dataset;
- synthetic draws inherit sampling noise and measurement artifacts from the source data;
- postprocessing can hide model failures if it is too aggressive;
- small treatment cells remain hard because the conditional distribution is weakly learned;
- privacy is not automatic and should not be claimed without a separate privacy analysis.
Finally, DDPMs are not always better than WGANs. If the table is small, smooth, and continuous, a WGAN or normalizing flow may be enough. If fast generation is essential, WGANs have a real advantage. If exact likelihood is required, flows are more natural. The case for DDPMs is strongest when the simulation target has complicated support, mixed variable types, and estimator-relevant dependence patterns that adversarial training struggles to preserve.
## Conclusion
The AIMM paper made an important methodological point: Monte Carlo studies can be made more credible by learning simulation designs from real applications. The next step is to broaden both the search over simulation objects and the search over baselines. Sometimes a resampling design will be enough. Sometimes a copula or flow will be the cleanest fit. Sometimes a WGAN will match a target estimand well even if it misses the broader row geometry. And sometimes a DDPM will be preferable because it preserves more of the dependence structure needed for later ablations.
That is why the right evaluation target is not simply whether one model wins a horse race on one downstream estimand. The central question is whether the learned design preserves the empirical geometry that makes the estimator succeed or fail, and whether it supports controlled perturbations of that geometry. On that criterion, diffusion is promising, but the more durable contribution is the design framework rather than the claim that one generator uniformly dominates all others.
## References