Diffusion Models for Realistic Tabular Simulation

Author

Apoorva Lal

Published

May 26, 2026

Abstract

Monte Carlo studies in econometrics are often analytically transparent but empirically thin: functional form, covariate distribution, error distribution, treatment assignment, and support restrictions are chosen for tractability rather than for fidelity to the empirical problems that make estimators fail. Athey, Imbens, Metzger, and Munro (2024) propose learning simulation designs from real datasets with Wasserstein GANs and then perturbing the fitted environment (Athey et al. 2024). This memo reruns that idea on local Lalonde/CPS data and compares a conditional WGAN-GP with a conditional DDPM under the same preprocessing, treatment vector, diagnostics, and regression-adjustment estimand. The useful conclusion is not that DDPM uniformly dominates WGAN. In these runs, WGAN is closer on the regression-adjustment ATT, while DDPM is much closer on the joint row-distribution metrics. For simulation design, that split is the point: a learned DGP should be judged by whether it preserves the empirical geometry that makes estimators succeed or fail, not only by whether one downstream estimand happens to line up in one draw.

Evidence snapshot

The rendered results below are based on executed Quarto/Python chunks, not just prose. The source data are the local ds-wgan Lalonde/CPS files: cps.feather and experimental.feather. The convenience cps_merged.feather file is reconstructed from those sources when needed; it is not a separate data requirement.

Key outputs from this render:

Small sklearn held-out checks: DDPM beats WGAN on sliced Wasserstein for iris, wine, breast-cancer, and diabetes feature tables; simple train-resampling remains competitive and is often better than both neural generators.
Lalonde/CPS single draw: observed regression-adjustment ATT is $690. WGAN synthetic ATT is $325; DDPM synthetic ATT is $1,888. On the same draw, DDPM is much closer on row-distribution metrics, with sliced Wasserstein 0.928 versus WGAN 3.774.
Repeated Lalonde/CPS draws (B = 12): WGAN has lower mean absolute ATT error ($356 vs $773), while DDPM has much lower mean sliced Wasserstein (0.882 vs 3.684) and lower marginal/correlation/covariance distances.
Treatment-share ablation: changing the treatment vector from the observed 1.14% treated share to a balanced 500/500 design moves the DDPM ATT from $1,888 to $196, illustrating how learned simulators can be used for controlled design perturbations rather than only synthetic-data mimicry.

Unsupported baselines are kept in their lane: copulas and flows are discussed as next baselines to implement, not as results; ICL is only a qualitative stress-test foil unless it is quantitatively scored.

Introduction

Suppose an applied study observes rows

\[ X_i = (Y_i, D_i, W_i), \qquad i=1,\ldots,n, \]

where $Y_i$ is an outcome, $D_i$ is a treatment or policy exposure, and $W_i$ collects pre-treatment covariates. A traditional Monte Carlo design specifies a distribution $P_0$ for $X_i$ and then evaluates estimators under repeated draws from $P_0$. The credibility of the exercise rests on whether $P_0$ captures the parts of the empirical problem that matter for the estimators being compared: overlap, tail behavior, discreteness, collinearity, heteroskedasticity, treatment imbalance, and nonlinear dependence.

The AIMM paper reframes the problem as a learned simulation design. Let $P_n$ be the empirical distribution of a real dataset. Fit a generator $\hat G$ so that samples from $\hat P_G$ look like draws from $P_n$. Then use $\hat P_G$ as the baseline simulation environment. This does not estimate structural primitives; it reduces arbitrary choices in simulation design while preserving the empirical geometry of the application.

This note uses the same objective and asks how the fitting engine changes when the adversarial WGAN is replaced by a denoising diffusion model.

What Is the Simulation Object?

Before comparing generators, it helps to separate three distinct simulation objects.

Design A: joint row simulation. Learn

\[ P(Y, D, W). \]

This is the most literal generative benchmark. It tries to reproduce the observed joint table and is useful when treatment prevalence and selection are part of the object of interest.

Design B: conditional row simulation with fixed treatment. Learn

\[ P(Y, W \mid D), \]

then hold the observed treatment vector fixed when generating synthetic datasets. This is the design used in the Lalonde exercise below. It isolates the fitted conditional row distribution from changes in treatment prevalence.

Design C: outcome simulation with a separate assignment model. Learn

\[ P(Y \mid D, W) \quad\text{and}\quad P(D \mid W) \]

separately, then intervene on the assignment mechanism or on the outcome surface. This is the right object when the simulation is about overlap deterioration, targeting rules, or policy counterfactuals rather than about reproducing one realized assignment vector.

The point of this distinction is that different baselines are good at different objects. Simple resampling works well when Design B is enough. Copulas or flows are natural when likelihood structure matters. WGANs and DDPMs become most useful when the target is a complicated joint row distribution with realistic dependence and when one wants structured ablations after fitting.

Learned Simulation Designs

A GAN has two networks. A generator maps noise $Z$ into synthetic rows:

\[ \tilde X = G_\theta(Z), \qquad Z \sim P_Z. \]

A discriminator or critic scores whether rows look real. In a Wasserstein GAN, the population objective is

\[ \min_\theta \max_{f \in \mathcal{F}_1} \left\{ \mathbb{E}_{X \sim P_n}[f(X)] - \mathbb{E}_{Z \sim P_Z}[f(G_\theta(Z))] \right\}, \]

where $\mathcal{F}_1$ is the class of 1-Lipschitz functions (Arjovsky, Chintala, and Bottou 2017). In practice, the Lipschitz constraint is enforced by weight clipping or a gradient penalty (Gulrajani et al. 2017).

For econometric simulation this has several attractions:

the generator is fast once trained;
it represents complicated row distributions without a parametric likelihood;
conditional generation can be implemented by feeding treatment, cohort, or market labels into the generator;
the Wasserstein critic gives a training signal with a distributional interpretation.

But the adversarial game is also the main cost. The generator improves only through the critic’s current gradients. If the critic is weak, the generator learns the wrong directions. If the critic is too strong or poorly regularized, training can become unstable. Mode collapse is a particularly bad failure for simulation studies because the generated sample can look locally plausible while missing subpopulations that determine estimator risk.

The optimizer choice matters because the problem is a saddle-point problem, not ordinary risk minimization. Simultaneous Adam is convenient, but the game can rotate around equilibria rather than converge. Optimistic Adam, mirror descent, and other game-aware updates are natural modifications because they use information about recent gradients to damp these rotations. That is a repair to the WGAN fitting problem, not a change in the underlying object being learned.

Denoising Diffusion Models

A denoising diffusion probabilistic model starts from the opposite direction. Instead of asking a generator to fool a critic, it defines a known corruption process that gradually turns data into noise. The model then learns to reverse that corruption.

Let $x_0 \in \mathbb{R}^d$ denote a standardized data row. Choose a noise schedule

\[ 0 < \beta_1 < \cdots < \beta_T < 1, \]

and define

\[ \alpha_t = 1-\beta_t, \qquad \bar\alpha_t = \prod_{s=1}^t \alpha_s. \]

The forward corruption process is

\[ q(x_t \mid x_{t-1}) = \mathcal{N}\left( \sqrt{\alpha_t}x_{t-1}, \beta_t I \right). \]

By Gaussian algebra, one can sample $x_t$ directly from $x_0$:

\[ x_t = \sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\epsilon, \qquad \epsilon \sim \mathcal{N}(0,I). \]

The model is a denoiser $\epsilon_\theta(x_t,t,c)$, where $c$ is optional context. In tabular simulations, $c$ might be treatment status, market, cohort, site, panel time, or a policy environment. The simple DDPM training objective is

\[ \min_\theta \mathbb{E}_{x_0,t,\epsilon} \left[ \left\| \epsilon - \epsilon_\theta(x_t,t,c) \right\|_2^2 \right]. \]

This is ordinary supervised learning. The label is the known noise used to corrupt the clean row. Ho, Jain, and Abbeel show how this objective arises from a variational bound and connects to denoising score matching (Ho, Jain, and Abbeel 2020).

Sampling reverses the chain. Start with

\[ x_T \sim \mathcal{N}(0,I), \]

and recursively apply

\[ \mu_\theta(x_t,t,c) = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1-\bar\alpha_t}} \epsilon_\theta(x_t,t,c) \right). \]

Then draw

\[ x_{t-1} = \mu_\theta(x_t,t,c) + \sigma_t z_t, \qquad z_t \sim \mathcal{N}(0,I), \]

with $\sigma_t$ set by the chosen reverse variance. After reaching $x_0$, invert the tabular preprocessing map and apply any column constraints: binary rounding, nonnegativity, bounded support, or category decoding.

Density Modeling Interpretation

The DDPM is a density model built from many noisy versions of the data. The forward process defines a family of smoothed densities $p_t(x)$, where $p_0$ is the data distribution and $p_T$ is close to a standard Gaussian. The denoiser learns a vector field that points noisy observations back toward regions of high probability under $p_t$.

The score of a density is

\[ s_t(x) = \nabla_x \log p_t(x). \]

For the Gaussian corruption step,

\[ \nabla_{x_t}\log q(x_t \mid x_0) = - \frac{x_t-\sqrt{\bar\alpha_t}x_0}{1-\bar\alpha_t} = - \frac{\epsilon}{\sqrt{1-\bar\alpha_t}}. \]

Therefore a noise predictor implies a score estimate:

\[ s_\theta(x_t,t,c) \approx - \frac{\epsilon_\theta(x_t,t,c)} {\sqrt{1-\bar\alpha_t}}. \]

This is the link to score-based generative modeling. Song, Sohl-Dickstein, Kingma, Kumar, Ermon, and Poole formulate the same idea in continuous time: add noise through a stochastic differential equation and sample backward using the estimated score field (Y. Song et al. 2021). DDPMs are one discrete-time implementation of that broader score-based idea.

The econometric analogy is useful. A likelihood model specifies $p_\theta(x)$ and maximizes $\sum_i \log p_\theta(X_i)$. A score-based model learns the local derivative $\nabla_x \log p_t(x)$ along many noisy versions of the distribution. It does not need a normalized likelihood. For simulation, this is enough: if the learned reverse chain approximately transports Gaussian noise into the empirical row distribution, then it supplies draws for a Monte Carlo design.

Tabular Rows

The original DDPM success story is image generation, but the row-distribution problem is different. In image data, all coordinates are intensities with spatial structure. In an econometric row, coordinates may be continuous, binary, categorical, censored, top-coded, or logically constrained.

A practical tabular DDPM therefore needs a representation map

\[ z = S(x), \]

where $S$ standardizes continuous variables, encodes categorical variables, and records constraints needed to invert the transformation. Continuous variables can be diffused with Gaussian noise. Binary and categorical variables can be handled in several ways:

keep them as relaxed numeric variables during diffusion and round after sampling;
use separate output heads and postprocessing rules;
use discrete diffusion kernels designed for categorical state spaces.

The last option is conceptually cleaner. Multinomial diffusion and argmax flows explicitly model categorical variables (Hoogeboom et al. 2021). Structured denoising diffusion models in discrete state spaces extend the corruption process beyond uniform replacement kernels and provide a general D3PM framework (Austin et al. 2021). TabDDPM combines Gaussian diffusion for numerical features with multinomial diffusion for categorical features and shows that diffusion can be competitive for general tabular data (Kotelnikov et al. 2023).

For econometric simulation, the choice should be governed by the role of each variable. Rounding a binary demographic indicator may be acceptable in a first benchmark. Rounding treatment status is usually not acceptable if the estimand conditions on the realized assignment mechanism. In that case, treatment should be fixed as context or generated by a separate assignment model.

Baseline Search: WGANs, Flows, Resampling, and ICL

The AIMM objective is to build realistic simulation designs from empirical datasets. Both WGANs and DDPMs are plausible tools. They differ mainly in what is hard.

Feature	WGAN simulation	DDPM simulation
Training problem	Minimax game between generator and critic	Supervised denoising objective
Main tuning risk	Critic strength, Lipschitz penalty, optimizer dynamics, mode collapse	Noise schedule, denoiser capacity, sampling steps, preprocessing
Sampling speed	Fast single generator pass	Slower iterative denoising
Diagnostics	Critic loss plus downstream distribution checks	Denoising loss plus downstream distribution checks
Mode coverage	Can be fragile when generator collapses	Usually strong because training sees noisy neighborhoods at all scales
Conditional designs	Feed context to generator and critic	Feed context to denoiser at every reverse step
Ablations	Possible but can destabilize the adversarial game	Natural: intervene on context, corruption path, or reverse constraints

This comparison explains why DDPMs are attractive for econometric practice. The fitting problem is closer to nonparametric regression than to adversarial equilibrium computation. Every training example supplies a known target $\epsilon$. The model is trained across many local perturbations of each row, which tends to preserve support and neighborhood structure. For a simulation study where the cost of a missed subpopulation can be high, that stability is valuable.

The tradeoff is sampling cost. A WGAN can generate $100{,}000$ rows with one forward pass per row. A DDPM may need tens or hundreds of denoising steps. Improved DDPMs learn reverse variances and improve the likelihood/sample-quality tradeoff (Nichol and Dhariwal 2021). DDIMs reduce the number of sampling steps by using a deterministic or partially deterministic non-Markovian sampler with the same training objective (J. Song, Meng, and Ermon 2021). In practice, a tabular simulation design can usually tolerate slower sampling because generating the synthetic benchmark is not the bottleneck relative to fitting many downstream estimators.

Resampling and Classical Tabular Baselines

A useful benchmark set should start below deep generative models. If the goal is Design B above, then a stratified bootstrap or nearest-neighbor resampling baseline is hard to beat on transparency. It preserves exact support, needs almost no tuning, and often reproduces the empirical treatment mix and sparse cells better than a badly tuned neural generator.

A second baseline family is low-dimensional parametric or semiparametric approximation: Gaussian copulas, rank-Gaussian transforms, or factor models fit to the continuous margin and then combined with empirical or logistic models for discrete columns. These baselines are cheap, interpretable, and often surprisingly competitive on moderate-dimensional tabular problems. They are weak when the empirical design depends on multimodality, sharp support boundaries, or treatment-specific dependence that is not well summarized by a copula.

So the practical search order for a new simulation study should usually be:

resampling / nearest-neighbor baselines;
copula-style likelihood baselines;
neural tabular generators such as WGANs, flows, and DDPMs;
prompt-only in-context generation as a stress test rather than a default production baseline.

That ordering keeps the learned-design agenda honest. A deep generator should clear simple baselines on the dimensions that matter for the estimator, not just on the dimensions that look impressive in a generative-model table.

Normalizing Flows

Normalizing flows are another natural density-modeling candidate. A flow constructs an invertible map $f_\theta$ such that

\[ z = f_\theta(x), \qquad z \sim p_Z, \]

and evaluates the likelihood by the change-of-variables formula:

\[ \log p_\theta(x) = \log p_Z(f_\theta(x)) + \log \left| \det \frac{\partial f_\theta(x)}{\partial x} \right|. \]

This gives exact likelihoods and direct sampling, which is attractive for statistical work (Papamakarios et al. 2021). The cost is architectural: $f_\theta$ must be invertible and have a tractable Jacobian determinant. Those constraints can make flows awkward for mixed discrete-continuous tabular rows, although many useful constructions exist.

Diffusion models relax invertibility. They do not give exact likelihood as directly as a flow, although likelihood bounds and probability-flow ODEs are available in the broader score-based framework. For simulation design, exact likelihood is less important than faithful reproduction of the empirical row distribution and stable conditional sampling. That tilts the practical comparison toward DDPMs unless likelihood evaluation itself is part of the research design.

In-Context Learning as a Baseline

A different comparison class is prompt-only generation from a frontier language model. The attraction is obvious: no training loop, no architecture search, and very flexible conditioning. Give the model a few rows, ask for more rows in the same format, and it can often imitate local patterns surprisingly well.

That makes in-context learning (ICL) interesting as a diagnostic baseline, especially for tiny tabular designs or for structured what-if prompts. But it also has obvious weaknesses for econometric simulation. The model has no explicit likelihood, no stable sampling semantics, and no guarantee that repeated generations trace a coherent joint distribution. Its errors are often subtle: support drift, over-smoothing, and synthetic rows that look individually plausible but do not add up to the right treatment geometry or tail mass.

To make that concrete, I prompted gpt-5.4 and gpt-5.4-mini with eight observed Lalonde/CPS-style rows and asked for eight additional synthetic rows in CSV only. The models returned:

gpt-5.4

t,age,education,black,hispanic,married,nodegree,re74,re75,re78
1.0,24.0,10.0,1.0,0.0,0.0,1.0,0.0,0.0,4210.337
1.0,31.0,12.0,0.0,1.0,1.0,0.0,0.0,1567.442,10854.229
1.0,20.0,9.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
1.0,27.0,11.0,0.0,0.0,0.0,1.0,1843.556,0.0,7321.884
0.0,39.0,12.0,0.0,0.0,1.0,0.0,18654.22,19487.63,22105.48
0.0,23.0,13.0,0.0,1.0,0.0,0.0,6421.318,7135.904,12844.67
0.0,41.0,10.0,1.0,0.0,1.0,1.0,11254.77,9873.441,15432.98
0.0,29.0,15.0,0.0,0.0,0.0,0.0,14387.56,16842.11,19765.34

gpt-5.4-mini

t,age,education,black,hispanic,married,nodegree,re74,re75,re78
1.0,31.0,10.0,1.0,0.0,0.0,1.0,0.0,0.0,7421.563
1.0,24.0,12.0,0.0,1.0,0.0,0.0,1250.0,0.0,4688.214
1.0,40.0,8.0,1.0,0.0,1.0,1.0,0.0,0.0,10345.77
1.0,27.0,13.0,0.0,0.0,0.0,0.0,540.25,1120.5,6890.332
0.0,38.0,12.0,1.0,0.0,1.0,0.0,18420.75,20110.34,21987.6
0.0,29.0,9.0,0.0,1.0,0.0,1.0,4320.18,3895.62,5120.44
0.0,50.0,10.0,1.0,0.0,1.0,1.0,16750.0,15890.25,17120.89
0.0,23.0,15.0,0.0,0.0,0.0,0.0,6200.0,8450.5,12675.31

These completions are not ridiculous. They respect basic support constraints, preserve the visual distinction between treated and control earnings histories, and produce rows that a human reader might accept on a skim. But that is exactly why ICL is a useful foil. The outputs are plausible at the row level while remaining uncalibrated at the distribution level. There is no guarantee that they preserve the dependence structure that makes an estimator difficult, and no clean way to map prompt engineering choices into a simulation design argument.

For now, ICL is best treated as a lightweight stress test or idea generator. It is not yet a replacement for a fitted simulator with explicit diagnostics, reproducible sampling, and a well-defined conditioning object.

Trex API Surface

Trex exposes WGAN and DDPM simulators through the same minimal interface: fit on a matrix of transformed tabular rows, optionally condition on a context matrix, and sample synthetic transformed rows. The fitted rows are transformed back to data scale with TabularTransformer.

For the WGAN benchmark, the class is trex.TabularWGAN:

TabularWGAN(
    hidden_dims=(128, 128, 128),
    critic_hidden_dims=None,
    noise_dim=None,
    batch_size=128,
    max_steps=1000,
    critic_steps=5,
    lr=1e-4,
    optimizer="adam",        # or "optimistic_adam"
    gp_weight=5.0,
    generator_dropout=0.1,
    critic_dropout=0.0,
    binary_dims=(),
    lower_bounds=None,
    upper_bounds=None,
    seed=None,
    device=None,
)

wgan.fit(X, context=None)
fake_z = wgan.sample(n, context=None)

For the DDPM benchmark, the class is trex.TabularDiffusion:

TabularDiffusion(
    hidden_dims=(128, 128, 128),
    time_dim=32,
    n_timesteps=100,
    beta_start=1e-4,
    beta_end=0.02,
    batch_size=128,
    max_steps=1000,
    lr=1e-3,
    dropout=0.0,
    seed=None,
    device=None,
)

ddpm.fit(X, context=None)
fake_z = ddpm.sample(n, context=None)

A typical tabular call is:

transformer = TabularTransformer(
    column_names=columns,
    binary_columns=binary_columns,
    nonnegative_columns=nonnegative_columns,
).fit(df[columns])

X = transformer.transform(df[columns])
context = df[["t"]].to_numpy()

ddpm = TabularDiffusion(max_steps=1000, seed=123, device="cuda")
ddpm.fit(X, context=context)

fake = transformer.inverse_transform(
    ddpm.sample(len(df), context=context).numpy()
)

This branch does not currently expose a TabularFlow class. A normalizing-flow simulator would fit the same interface naturally:

flow.fit(X, context=None)
fake_z = flow.sample(n, context=None)
logp = flow.log_prob(X, context=None)

The extra method is log_prob, since exact or tractable likelihood evaluation is the main reason to add a flow baseline. The implementation choice would then be a conditional spline flow, masked autoregressive flow, or coupling flow depending on whether tabular likelihood or fast sampling is the priority.

Distance Metrics

Let $X_1,\ldots,X_n \in \mathbb{R}^d$ denote observed rows and $\tilde X_1,\ldots,\tilde X_m \in \mathbb{R}^d$ denote synthetic rows. All diagnostics below are computed after the same preprocessing convention used for the simulation exercise; in the Lalonde application, earnings are measured in thousands of dollars. Lower values indicate closer agreement.

These are still generator-native diagnostics, so they need interpretation. In a causal simulation study, the object is not generic sample quality. The object is whether the synthetic design preserves overlap, treatment imbalance, leverage points, tail behavior, and the nuisance-function geometry that drives finite-sample estimator risk. The metrics below are therefore proxies, not sufficient statistics. A good simulation design should be able to explain which feature each metric is meant to guard.

For coordinate $j$, the one-dimensional Wasserstein distance is

\[ W_{1j} = \int_0^1 \left| \hat F_{j}^{-1}(u) - \hat G_{j}^{-1}(u) \right|du, \]

where $\hat F_j$ and $\hat G_j$ are the empirical CDFs of observed and synthetic column $j$. The reported marginal_w1_mean and marginal_w1_max are

\[ \frac{1}{d}\sum_{j=1}^d W_{1j}, \qquad \max_{1 \leq j \leq d} W_{1j}. \]

The Kolmogorov-Smirnov distance for coordinate $j$ is

\[ K_j = \sup_x \left| \hat F_j(x) - \hat G_j(x) \right|. \]

The reported marginal_ks_mean and marginal_ks_max average and maximize $K_j$ across columns. The KS distance is scale-free, while Wasserstein distance preserves the units of each variable.

The mean-distance diagnostic is

\[ \text{mean\_l2} = \left\| \bar X - \bar{\tilde X} \right\|_2. \]

It is a location check. It can be small even when tails or dependence are wrong.

Let $\hat\Sigma_X$ and $\hat\Sigma_{\tilde X}$ be empirical covariance matrices. The covariance diagnostic is

\[ \text{cov\_frobenius} = \left\| \hat\Sigma_X - \hat\Sigma_{\tilde X} \right\|_F. \]

Because covariance is measured in original units, high-variance earnings columns can dominate it. The correlation diagnostic repeats the same calculation after converting covariance matrices into correlation matrices:

\[ \text{corr\_frobenius} = \left\| \hat R_X - \hat R_{\tilde X} \right\|_F. \]

This is the cleaner summary of linear dependence across mixed-scale columns.

Finally, sliced Wasserstein averages one-dimensional Wasserstein distances over random projections. For a unit vector $u \in \mathbb{S}^{d-1}$, project rows to $u'X_i$ and $u'\tilde X_i$. The population target is

\[ \text{SW}_1(X,\tilde X) = \mathbb{E}_{u \sim \operatorname{Unif}(\mathbb{S}^{d-1})} \left[ W_1(u'X, u'\tilde X) \right]. \]

The reported value approximates this expectation with random directions. This is useful in tabular simulation because it can detect joint-distribution failures that are invisible in marginal diagnostics, while remaining much cheaper than exact high-dimensional optimal transport.

Simulation Protocol

A useful DDPM simulator for econometrics should make the target distribution explicit. The following workflow is close to the AIMM proposal but swaps the fitting engine:

Choose the empirical target: unconditional rows, rows conditional on treatment, rows within market, or a panel transition distribution.
Define a preprocessing map $S$ and record all inverse transformations.
Decide which variables are generated and which are fixed as context.
Fit a denoiser with a transparent noise schedule and held-out diagnostics.
Generate synthetic samples with the same sample size, treatment shares, panel length, or market composition as the empirical design.
Report distances between real and generated data: marginal Wasserstein, KS distances, mean distance, covariance and correlation Frobenius norms, sliced Wasserstein, and downstream estimand replication.
Run estimator comparisons on many generated datasets.
Add ablations by changing one empirical feature at a time: overlap, tails, treatment imbalance, covariance, policy rules, or outcome noise.

The ablation step is where diffusion can be especially useful. Because generation is a gradual reverse process, one can condition, guide, or project intermediate draws. Examples include:

holding treatment fixed and generating only $(Y,W)$ conditional on $D$;
matching marginal treatment shares exactly while allowing covariates and outcomes to vary;
projecting generated rows back into known support restrictions;
increasing or decreasing tail noise to stress-test robust estimators;
conditioning on policy environments to compare counterfactual regimes.

The relevant standard is practical: does the DDPM improve the credibility of finite-sample estimator comparisons?

Empirical Illustration

Iris Sanity Check

The Iris dataset is too small and too clean to be an econometric benchmark. It is useful here because it lets us see the mechanics without hiding behind a large application. The row is

\[ x_i = (\text{sepal length}, \text{sepal width}, \text{petal length}, \text{petal width}, \text{species}). \]

The exercise fixes the species counts and learns the conditional distribution of the four continuous measurements given species. This is the same design pattern one would use when treatment status or site membership is part of the simulation design rather than a variable the generator is allowed to change.

Code

from __future__ import annotations

import sys
import time
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import torch
from scipy.stats import wasserstein_distance
from sklearn.datasets import load_breast_cancer, load_diabetes, load_iris, load_wine
from sklearn.model_selection import train_test_split

TREX_ROOT = Path("/Users/alal/Desktop/code/trex")
if str(TREX_ROOT) not in sys.path:
    sys.path.insert(0, str(TREX_ROOT))

from trex import TabularDiffusion, TabularTransformer, TabularWGAN, distribution_metrics

SEED = 20260526
rng = np.random.default_rng(SEED)
torch.manual_seed(SEED)
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
RESULTS_DIR = Path("results")
RESULTS_DIR.mkdir(exist_ok=True)

def one_hot(values: pd.Series | np.ndarray) -> np.ndarray:
    values = np.asarray(values, dtype=int)
    levels = np.sort(np.unique(values))
    return np.column_stack([(values == level).astype(float) for level in levels])

def tidy_metrics(real: pd.DataFrame, fake: pd.DataFrame, columns: list[str], label: str) -> pd.DataFrame:
    out = distribution_metrics(real[columns].to_numpy(), fake[columns].to_numpy(), seed=SEED)
    return pd.DataFrame([{"sample": label, **out}])

Code

iris = load_iris(as_frame=True)
iris_df = iris.frame.copy()
iris_df.columns = [
    "sepal_length",
    "sepal_width",
    "petal_length",
    "petal_width",
    "species",
]
iris_df["species"] = iris_df["species"].astype(int)

iris_features = ["sepal_length", "sepal_width", "petal_length", "petal_width"]
iris_context = one_hot(iris_df["species"])

iris_transformer = TabularTransformer(
    column_names=iris_features,
    nonnegative_columns=iris_features,
).fit(iris_df[iris_features])
iris_z = iris_transformer.transform(iris_df[iris_features])
iris_lower, iris_upper = iris_transformer.transformed_bounds()

iris_wgan = TabularWGAN(
    hidden_dims=(64, 64),
    critic_hidden_dims=(64, 64),
    batch_size=64,
    max_steps=500,
    critic_steps=2,
    lr=1e-4,
    optimizer="optimistic_adam",
    lower_bounds=iris_lower,
    upper_bounds=iris_upper,
    seed=SEED,
    device=DEVICE,
)
start = time.perf_counter()
iris_wgan.fit(iris_z, context=iris_context)
iris_wgan_seconds = time.perf_counter() - start

iris_wgan_z = iris_wgan.sample(len(iris_df), context=iris_context).numpy()
iris_wgan_fake = pd.DataFrame(
    iris_transformer.inverse_transform(iris_wgan_z),
    columns=iris_features,
)
iris_wgan_fake["species"] = iris_df["species"].to_numpy()

iris_ddpm = TabularDiffusion(
    hidden_dims=(64, 64),
    n_timesteps=50,
    max_steps=700,
    batch_size=64,
    lr=1e-3,
    seed=SEED,
    device=DEVICE,
)
start = time.perf_counter()
iris_ddpm.fit(iris_z, context=iris_context)
iris_seconds = time.perf_counter() - start

iris_fake_z = iris_ddpm.sample(len(iris_df), context=iris_context).numpy()
iris_fake = pd.DataFrame(
    iris_transformer.inverse_transform(iris_fake_z),
    columns=iris_features,
)
iris_fake["species"] = iris_df["species"].to_numpy()

iris_metrics = pd.concat(
    [
        tidy_metrics(
            iris_df,
            iris_wgan_fake,
            [*iris_features, "species"],
            f"WGAN synthetic, {iris_wgan_seconds:.1f}s",
        ),
        tidy_metrics(
            iris_df,
            iris_fake,
            [*iris_features, "species"],
            f"DDPM synthetic, {iris_seconds:.1f}s",
        ),
    ],
    ignore_index=True,
)
iris_metrics.to_csv(RESULTS_DIR / "iris_metrics.csv", index=False)
iris_metrics[[
    "sample",
    "marginal_w1_mean",
    "marginal_ks_mean",
    "mean_l2",
    "cov_frobenius",
    "corr_frobenius",
    "sliced_wasserstein",
]].round(4).style.hide(axis="index")

sample	marginal_w1_mean	marginal_ks_mean	mean_l2	cov_frobenius	corr_frobenius	sliced_wasserstein
WGAN synthetic, 1.2s	0.524800	0.349300	0.848500	4.568700	2.685100	0.533200
DDPM synthetic, 0.2s	0.116400	0.126700	0.123300	1.367900	0.613000	0.131300

Code

species_names = dict(enumerate(iris.target_names))
fig, axes = plt.subplots(1, 3, figsize=(9, 3.4), sharex=True, sharey=True, constrained_layout=True)

for ax, data, title in [
    (axes[0], iris_df, "Observed Iris"),
    (axes[1], iris_wgan_fake, "WGAN synthetic"),
    (axes[2], iris_fake, "DDPM synthetic"),
]:
    for species, group in data.groupby("species"):
        ax.scatter(
            group["sepal_length"],
            group["petal_length"],
            s=28,
            alpha=0.75,
            label=species_names[int(species)],
        )
    ax.set_title(title)
    ax.set_xlabel("sepal length")
axes[0].set_ylabel("petal length")
axes[2].legend(frameon=False, loc="lower right")
plt.show()

The Iris example is deliberately modest. Its diagnostic role is to check whether small conditional generators preserve species-conditional clusters and produce plausible continuous measurements. The WGAN panel shows the adversarial baseline; the DDPM panel shows the same task fit through supervised denoising.

Held-Out Distance Benchmarks

For reporting generator quality, a held-out comparison is usually preferable to comparing synthetic rows against the same empirical rows used for fitting. The in-sample distance answers whether the generator can reproduce the training distribution. The held-out distance asks a more relevant question: after fitting on one sample, how close are generated rows to new rows from the same empirical source?

The following benchmark uses small numeric datasets bundled with scikit-learn. For each dataset, the models fit only the training rows. They then generate the same number of rows as the test split, and distances are computed against the held-out test rows. A train-resample baseline draws rows with replacement from the training set; it is not a learned generator, but it gives a useful finite-sample reference.

Code

def sklearn_frame(loader, name: str) -> pd.DataFrame:
    data = loader(as_frame=True)
    frame = data.frame.copy()
    if "target" in frame.columns:
        frame = frame.drop(columns=["target"])
    frame.columns = [f"x{j}" for j in range(frame.shape[1])]
    frame = frame.astype(float)
    frame.attrs["dataset_name"] = name
    return frame

sklearn_datasets = [
    sklearn_frame(load_iris, "iris features"),
    sklearn_frame(load_wine, "wine features"),
    sklearn_frame(load_breast_cancer, "breast cancer features"),
    sklearn_frame(load_diabetes, "diabetes features"),
]

def fit_small_generators(train: pd.DataFrame, test: pd.DataFrame, seed: int) -> pd.DataFrame:
    columns = list(train.columns)
    transformer = TabularTransformer(
        column_names=columns,
        nonnegative_columns=columns,
    ).fit(train)
    z_train = transformer.transform(train)
    lower, upper = transformer.transformed_bounds()
    n_test = len(test)

    wgan_model = TabularWGAN(
        hidden_dims=(48, 48),
        critic_hidden_dims=(48, 48),
        batch_size=min(128, len(train)),
        max_steps=250,
        critic_steps=2,
        lr=1e-4,
        optimizer="optimistic_adam",
        lower_bounds=lower,
        upper_bounds=upper,
        seed=seed,
        device=DEVICE,
    )
    ddpm_model = TabularDiffusion(
        hidden_dims=(48, 48),
        n_timesteps=50,
        batch_size=min(128, len(train)),
        max_steps=300,
        lr=1e-3,
        seed=seed,
        device=DEVICE,
    )

    wgan_model.fit(z_train)
    ddpm_model.fit(z_train)

    train_resample = train.sample(n_test, replace=True, random_state=seed).reset_index(drop=True)
    wgan_fake = pd.DataFrame(
        transformer.inverse_transform(wgan_model.sample(n_test).numpy()),
        columns=columns,
    )
    ddpm_fake = pd.DataFrame(
        transformer.inverse_transform(ddpm_model.sample(n_test).numpy()),
        columns=columns,
    )

    rows = []
    for label, fake in [
        ("train resample", train_resample),
        ("WGAN", wgan_fake),
        ("DDPM", ddpm_fake),
    ]:
        metrics = distribution_metrics(
            test[columns].to_numpy(),
            fake[columns].to_numpy(),
            n_projections=64,
            seed=seed,
        )
        rows.append({"model": label, **metrics})
    return pd.DataFrame(rows)

heldout_rows = []
for k, frame in enumerate(sklearn_datasets):
    train, test = train_test_split(frame, test_size=0.30, random_state=SEED + k)
    result = fit_small_generators(
        train.reset_index(drop=True),
        test.reset_index(drop=True),
        SEED + 100 * k,
    )
    result.insert(0, "dataset", frame.attrs["dataset_name"])
    result.insert(1, "n_train", len(train))
    result.insert(2, "n_test", len(test))
    heldout_rows.append(result)

heldout_table = pd.concat(heldout_rows, ignore_index=True)
heldout_table.to_csv(RESULTS_DIR / "sklearn_heldout_metrics.csv", index=False)
heldout_table[[
    "dataset",
    "n_train",
    "n_test",
    "model",
    "marginal_w1_mean",
    "marginal_ks_mean",
    "mean_l2",
    "corr_frobenius",
    "sliced_wasserstein",
]].round(4).style.hide(axis="index")

dataset	n_train	n_test	model	marginal_w1_mean	marginal_ks_mean	mean_l2	corr_frobenius	sliced_wasserstein
iris features	105	45	train resample	0.185600	0.144400	0.092700	0.526400	0.180800
iris features	105	45	WGAN	0.753200	0.594400	1.368400	1.447100	0.695200
iris features	105	45	DDPM	0.306700	0.238900	0.119100	0.469100	0.294000
wine features	124	54	train resample	7.238800	0.170900	85.463200	2.266000	19.293800
wine features	124	54	WGAN	18.941500	0.539900	113.674900	6.517400	50.155600
wine features	124	54	DDPM	4.880900	0.215100	9.460400	3.011400	12.390200
breast cancer features	398	171	train resample	6.070300	0.092400	105.952600	3.840500	16.305200
breast cancer features	398	171	WGAN	20.341700	0.484400	171.726800	12.778800	55.228900
breast cancer features	398	171	DDPM	9.368900	0.158100	69.841300	12.734900	33.138300
diabetes features	309	133	train resample	0.006400	0.096200	0.013100	0.971300	0.006200
diabetes features	309	133	WGAN	0.038700	0.564700	0.041000	3.948300	0.037600
diabetes features	309	133	DDPM	0.026900	0.557100	0.059800	2.294300	0.025000

Code

fig, ax = plt.subplots(figsize=(8, 3.6), constrained_layout=True)
plot_data = heldout_table.pivot(index="dataset", columns="model", values="sliced_wasserstein")
plot_data = plot_data[["train resample", "WGAN", "DDPM"]]
x = np.arange(len(plot_data))
width = 0.24
for j, model in enumerate(plot_data.columns):
    ax.bar(x + (j - 1) * width, plot_data[model], width=width, label=model)
ax.set_xticks(x)
ax.set_xticklabels(plot_data.index, rotation=20, ha="right")
ax.set_ylabel("sliced Wasserstein to held-out test rows")
ax.set_title("Held-out row-distribution distance")
ax.legend(frameon=False, ncols=3)
plt.show()

Held-out distances are noisy on these small datasets, but they are conceptually cleaner than in-sample distances. A generator should be judged against rows it did not see during fitting, and the train-resample baseline helps distinguish model failure from the ordinary sampling variability of a small empirical distribution.

Lalonde/CPS Design

The AIMM paper uses WGANs to build simulation designs for treatment-effect estimators. The canonical version of that problem starts from the Lalonde/Dehejia-Wahba treated sample and a large observational control pool, then asks whether nonexperimental estimators recover an experimental benchmark. The row distribution is difficult for simple parametric simulation because treatment is rare, earnings are heavy-tailed and truncated, and the relationship between demographics and earnings is not well described by a small Gaussian model.

The exercise below loads the local ds-wgan Lalonde/CPS data, scales earnings to thousands of dollars, fits a conditional WGAN and a conditional DDPM, and evaluates both as simulation DGPs. Treatment status is held fixed as context, so each generator learns the conditional distribution of demographics and earnings given treatment. If the convenience cps_merged.feather file is absent, the code reconstructs it from ds-wgan/data/original_data/cps.feather and experimental.feather; these two files are the actual source data.

Code

LDW_COLUMNS = [
    "t", "age", "education", "black", "hispanic",
    "married", "nodegree", "re74", "re75", "re78",
]
EARNINGS_COLUMNS = ["re74", "re75", "re78"]
BINARY_COLUMNS = ["black", "hispanic", "married", "nodegree"]
NONNEGATIVE_COLUMNS = ["age", "education", "re74", "re75", "re78"]
COVARIATES = ["age", "education", "black", "hispanic", "married", "nodegree", "re74", "re75"]
OUTCOME = "re78"
TREATMENT = "t"

code_root = TREX_ROOT.parent
merged_path = code_root / "dswgan-paper" / "data" / "original_data" / "cps_merged.feather"
cps_path = code_root / "ds-wgan" / "data" / "original_data" / "cps.feather"
experimental_path = code_root / "ds-wgan" / "data" / "original_data" / "experimental.feather"

if cps_path.exists() and experimental_path.exists():
    cps = pd.read_feather(cps_path)
    experimental = pd.read_feather(experimental_path)
    if list(cps.columns) != list(experimental.columns):
        raise ValueError("cps.feather and experimental.feather column mismatch")
    cps_t_values = set(pd.Series(cps[TREATMENT]).dropna().astype(float).unique().tolist())
    if cps_t_values == {0.0, 1.0}:
        lalonde_raw = cps.copy()
        data_source = "ds-wgan cps.feather, already containing treated and control rows"
    else:
        treated = experimental.loc[experimental[TREATMENT].astype(float) == 1.0]
        controls = cps.loc[cps[TREATMENT].astype(float) == 0.0]
        lalonde_raw = pd.concat([treated, controls], ignore_index=True)
        data_source = "ds-wgan experimental treated rows stacked with ds-wgan cps controls"
    merged_path.parent.mkdir(parents=True, exist_ok=True)
    lalonde_raw.reset_index(drop=True).to_feather(merged_path)
    data_path = merged_path
elif merged_path.exists():
    data_path = merged_path
    lalonde_raw = pd.read_feather(data_path)
    data_source = "fallback existing cps_merged.feather convenience file"
else:
    raise FileNotFoundError(
        "Expected local ds-wgan CPS sources at "
        f"{cps_path} and {experimental_path}; merged convenience path is {merged_path}"
    )

lalonde = lalonde_raw[LDW_COLUMNS].astype(float)
lalonde[EARNINGS_COLUMNS] = lalonde[EARNINGS_COLUMNS] / 1000.0

lalonde_summary = (
    lalonde[TREATMENT]
    .value_counts()
    .rename_axis("treatment")
    .reset_index(name="n")
    .assign(share=lambda d: d["n"] / d["n"].sum())
)
treated_n = int(lalonde[TREATMENT].sum())
control_n = int(len(lalonde) - treated_n)
treated_share = treated_n / len(lalonde)

The Lalonde/CPS sample has 16,177 rows, with 185 treated observations and 15,992 controls; the treated share is 1.14 percent. Earnings are divided by 1,000 before fitting so that the neural models are not dominated by raw dollar scales. In this render, the loaded source is recorded by data_source; the source files are the local ds-wgan CPS/experimental feathers, with cps_merged.feather only a reproducible convenience copy.

Code

provenance_table = pd.DataFrame([
    {
        "loaded_source": data_source,
        "rows": len(lalonde),
        "treated": treated_n,
        "controls": control_n,
        "treated_share": treated_share,
        "cps_source_exists": cps_path.exists(),
        "experimental_source_exists": experimental_path.exists(),
        "merged_convenience_path": str(merged_path),
    }
])
provenance_table.to_csv(RESULTS_DIR / "lalonde_data_provenance.csv", index=False)
provenance_table.style.hide(axis="index")

loaded_source	rows	treated	controls	treated_share	cps_source_exists	experimental_source_exists	merged_convenience_path
ds-wgan cps.feather, already containing treated and control rows	16177	185	15992	0.011436	True	True	/Users/alal/Desktop/code/dswgan-paper/data/original_data/cps_merged.feather

Code

def design_matrix(df: pd.DataFrame) -> np.ndarray:
    return np.column_stack([np.ones(len(df)), df[COVARIATES].to_numpy(dtype=float)])

def regression_adjustment_att(df: pd.DataFrame) -> dict[str, float | int | str]:
    data = df[LDW_COLUMNS].astype(float).copy()
    t = data[TREATMENT].to_numpy(dtype=int)
    y = data[OUTCOME].to_numpy(dtype=float)
    X = design_matrix(data)
    treated = t == 1
    control = t == 0
    if treated.sum() < 5 or control.sum() <= X.shape[1] + 1:
        return {
            "att_thousand": np.nan,
            "att_dollars": np.nan,
            "se_thousand": np.nan,
            "se_dollars": np.nan,
            "n": len(data),
            "n_treated": int(treated.sum()),
            "n_control": int(control.sum()),
            "status": "insufficient support",
        }
    beta0 = np.linalg.lstsq(X[control], y[control], rcond=None)[0]
    treated_effects = y[treated] - X[treated] @ beta0
    att = float(treated_effects.mean())
    se = float(treated_effects.std(ddof=1) / np.sqrt(treated.sum()))
    return {
        "att_thousand": att,
        "att_dollars": 1000.0 * att,
        "se_thousand": se,
        "se_dollars": 1000.0 * se,
        "n": len(data),
        "n_treated": int(treated.sum()),
        "n_control": int(control.sum()),
        "status": "ok",
    }

def result_row(name: str, df: pd.DataFrame) -> dict[str, float | int | str]:
    row = {"sample": name, **regression_adjustment_att(df)}
    row["treated_share"] = row["n_treated"] / row["n"]
    return row

Code

feature_columns = [column for column in LDW_COLUMNS if column != TREATMENT]
feature_binary = [column for column in BINARY_COLUMNS if column in feature_columns]
feature_nonnegative = [column for column in NONNEGATIVE_COLUMNS if column in feature_columns]

features = lalonde[feature_columns].copy()
context = lalonde[[TREATMENT]].to_numpy(dtype=np.float32)

lalonde_transformer = TabularTransformer(
    column_names=feature_columns,
    binary_columns=feature_binary,
    nonnegative_columns=feature_nonnegative,
).fit(features)
train_z = lalonde_transformer.transform(features)
lower, upper = lalonde_transformer.transformed_bounds()

def rebuild_lalonde(generated_features: np.ndarray, generated_t: np.ndarray) -> pd.DataFrame:
    out = pd.DataFrame(generated_features, columns=feature_columns)
    out.insert(0, TREATMENT, generated_t.reshape(-1).astype(float))
    out = out[LDW_COLUMNS].astype(float)
    for col in [TREATMENT, *BINARY_COLUMNS]:
        out[col] = (np.clip(out[col], 0.0, 1.0) >= 0.5).astype(float)
    for col in NONNEGATIVE_COLUMNS:
        out[col] = np.maximum(out[col], 0.0)
    return out

wgan = TabularWGAN(
    hidden_dims=(64, 64),
    critic_hidden_dims=(64, 64),
    batch_size=512,
    max_steps=450,
    critic_steps=2,
    lr=1e-4,
    optimizer="optimistic_adam",
    gp_weight=5.0,
    binary_dims=[feature_columns.index(col) for col in feature_binary],
    lower_bounds=lower,
    upper_bounds=upper,
    seed=SEED,
    device=DEVICE,
)
ddpm = TabularDiffusion(
    hidden_dims=(64, 64),
    n_timesteps=80,
    batch_size=512,
    max_steps=650,
    lr=1e-3,
    seed=SEED,
    device=DEVICE,
)

start = time.perf_counter()
wgan.fit(train_z, context=context)
wgan_seconds = time.perf_counter() - start

start = time.perf_counter()
ddpm.fit(train_z, context=context)
ddpm_seconds = time.perf_counter() - start

wgan_fake = rebuild_lalonde(
    lalonde_transformer.inverse_transform(wgan.sample(len(lalonde), context=context).numpy()),
    context,
)
ddpm_fake = rebuild_lalonde(
    lalonde_transformer.inverse_transform(ddpm.sample(len(lalonde), context=context).numpy()),
    context,
)

fit_times = pd.DataFrame([
    {"model": "WGAN-GP", "seconds": wgan_seconds},
    {"model": "DDPM", "seconds": ddpm_seconds},
])

The timing comparison is included only to show the practical cost of the two fitting routines in this small benchmark. It is not the main estimand of the exercise.

Code

fig, ax = plt.subplots(figsize=(5, 2.4), constrained_layout=True)
ax.bar(fit_times["model"], fit_times["seconds"])
ax.set_ylabel("seconds")
ax.set_title("Generator fit time")
plt.show()

Code

att_table = pd.DataFrame([
    result_row("Observed CPS", lalonde),
    result_row("WGAN-GP synthetic", wgan_fake),
    result_row("DDPM synthetic", ddpm_fake),
])

metric_table = pd.concat([
    tidy_metrics(lalonde, wgan_fake, LDW_COLUMNS, "WGAN-GP synthetic"),
    tidy_metrics(lalonde, ddpm_fake, LDW_COLUMNS, "DDPM synthetic"),
], ignore_index=True)

att_table.to_csv(RESULTS_DIR / "lalonde_att_table.csv", index=False)
metric_table.to_csv(RESULTS_DIR / "lalonde_metric_table.csv", index=False)
att_table.round(4).style.hide(axis="index")

sample	att_thousand	att_dollars	se_thousand	se_dollars	n	n_treated	n_control	status	treated_share
Observed CPS	0.689900	689.858100	0.598500	598.457200	16177	185	15992	ok	0.011400
WGAN-GP synthetic	0.325000	324.978700	0.093900	93.919900	16177	185	15992	ok	0.011400
DDPM synthetic	1.887900	1887.869900	0.476900	476.898700	16177	185	15992	ok	0.011400

Code

metric_table[[
    "sample",
    "marginal_w1_mean",
    "marginal_ks_mean",
    "mean_l2",
    "cov_frobenius",
    "corr_frobenius",
    "sliced_wasserstein",
]].round(4).style.hide(axis="index")

sample	marginal_w1_mean	marginal_ks_mean	mean_l2	cov_frobenius	corr_frobenius	sliced_wasserstein
WGAN-GP synthetic	2.874700	0.292200	5.596100	247.600400	3.214200	3.773500
DDPM synthetic	0.890900	0.085000	1.320100	84.297200	0.667600	0.927900

The ATT and distance tables separate two questions. The ATT table asks whether the downstream regression-adjustment estimand is similar. The distance table asks whether the generated rows resemble the empirical CPS rows before any estimator is applied. In the rendered run, the DDPM is closer on every reported distributional distance: marginal Wasserstein, marginal KS, mean distance, covariance distance, correlation distance, and sliced Wasserstein. This is the main evidence that the DDPM is giving a more faithful simulation DGP than the WGAN in this application.

The sliced Wasserstein distance is especially useful here because it is multivariate. It projects the full row vector onto many random directions and averages the one-dimensional Wasserstein distances. A generator can match each column separately while getting the joint geometry wrong; sliced Wasserstein is meant to catch that failure without solving a full high-dimensional optimal-transport problem. The Frobenius distances for covariance and correlation matrices provide a more classical second-moment version of the same check.

Code

plot_items = [
    ("Observed CPS", lalonde),
    ("WGAN-GP synthetic", wgan_fake),
    ("DDPM synthetic", ddpm_fake),
]

fig, axes = plt.subplots(1, 3, figsize=(9, 3.4), sharex=True, sharey=True, constrained_layout=True)
for ax, (title, df) in zip(axes, plot_items):
    sample = df.sample(min(4000, len(df)), random_state=SEED)
    ax.scatter(sample["re75"], sample["re78"], s=5, alpha=0.16)
    ax.set_title(title)
    ax.set_xlabel("1975 earnings, thousands")
axes[0].set_ylabel("1978 earnings, thousands")
plt.show()

One synthetic dataset is only a first pass. A generator can match the ATT and still miss the row distribution. The distance table and scatter plot are therefore part of the simulation design, not decoration. The DDPM is attractive if it gives a better joint distribution while preserving estimator-relevant features such as treatment imbalance, heavy earnings tails, and the dependence between pre-treatment and post-treatment earnings.

Repeated Simulation Results

The single synthetic draw above answers whether one fitted generator can produce a plausible CPS-like table. A simulation study also needs repeated draws from the fitted generator. The estimand and sample geometry are held fixed: every synthetic dataset has the observed sample size and the observed treatment vector. This isolates the fitted row distribution from changes in treatment prevalence.

The repeated design is:

\[ \tilde X_b^{(m)} \sim \hat P_m(\cdot \mid D = d_i), \qquad i=1,\ldots,n,\quad b=1,\ldots,B, \]

where $m \in \{\text{WGAN}, \text{DDPM}\}$ indexes the generator and $d_i$ is the observed treatment assignment for row $i$. Each replicate is evaluated by the same regression-adjustment ATT and by the same distribution distances used above.

Code

def synthetic_from_wgan(context_array: np.ndarray) -> pd.DataFrame:
    generated = wgan.sample(len(context_array), context=context_array).numpy()
    return rebuild_lalonde(
        lalonde_transformer.inverse_transform(generated),
        context_array,
    )

def synthetic_from_ddpm(context_array: np.ndarray) -> pd.DataFrame:
    generated = ddpm.sample(len(context_array), context=context_array).numpy()
    return rebuild_lalonde(
        lalonde_transformer.inverse_transform(generated),
        context_array,
    )

def simulation_row(model_name: str, replicate: int, df: pd.DataFrame) -> dict[str, float | int | str]:
    att = result_row(model_name, df)
    metrics = distribution_metrics(
        lalonde[LDW_COLUMNS].to_numpy(),
        df[LDW_COLUMNS].to_numpy(),
        n_projections=64,
        seed=SEED + replicate,
    )
    return {
        "model": model_name,
        "replicate": replicate,
        "att_dollars": att["att_dollars"],
        "se_dollars": att["se_dollars"],
        "att_error_dollars": att["att_dollars"] - regression_adjustment_att(lalonde)["att_dollars"],
        "sliced_wasserstein": metrics["sliced_wasserstein"],
        "marginal_ks_mean": metrics["marginal_ks_mean"],
        "corr_frobenius": metrics["corr_frobenius"],
        "cov_frobenius": metrics["cov_frobenius"],
    }

N_REPLICATES = 12
simulation_rows = []
for b in range(N_REPLICATES):
    simulation_rows.append(simulation_row("WGAN-GP", b, synthetic_from_wgan(context)))
    simulation_rows.append(simulation_row("DDPM", b, synthetic_from_ddpm(context)))

simulation_table = pd.DataFrame(simulation_rows)
simulation_summary = (
    simulation_table
    .groupby("model")
    .agg(
        att_mean_dollars=("att_dollars", "mean"),
        att_sd_dollars=("att_dollars", "std"),
        abs_att_error_mean=("att_error_dollars", lambda x: np.mean(np.abs(x))),
        sliced_wasserstein_mean=("sliced_wasserstein", "mean"),
        marginal_ks_mean=("marginal_ks_mean", "mean"),
        corr_frobenius_mean=("corr_frobenius", "mean"),
        cov_frobenius_mean=("cov_frobenius", "mean"),
    )
    .reset_index()
)
simulation_table.to_csv(RESULTS_DIR / "lalonde_repeated_draws.csv", index=False)
simulation_summary.to_csv(RESULTS_DIR / "lalonde_repeated_summary.csv", index=False)
simulation_summary.round(4).style.hide(axis="index")

model	att_mean_dollars	att_sd_dollars	abs_att_error_mean	sliced_wasserstein_mean	marginal_ks_mean	corr_frobenius_mean	cov_frobenius_mean
DDPM	1460.777500	646.709800	773.269300	0.881900	0.084600	0.627600	84.818000
WGAN-GP	333.496600	65.824700	356.361600	3.684200	0.292000	3.211000	247.855500

The repeated draws turn the generator comparison into a Monte Carlo object. The ATT columns ask whether the downstream estimator sees a similar treatment-effect problem. The distance columns ask whether the synthetic tables reproduce the empirical distribution that makes the treatment-effect problem hard. In this run, the WGAN has a lower average absolute ATT error, while the DDPM is much closer on the joint distribution metrics. That split is informative: an estimator-specific target can look acceptable even when the generated table is too far from the empirical row distribution. For simulation design, the DDPM is the more credible row-distribution baseline in this run because it preserves more of the distribution that future estimators will actually face. For estimator-specific calibration, the WGAN result is a warning: a generator can be less faithful to the table and still hit one downstream target more closely.

Code

fig, axes = plt.subplots(1, 2, figsize=(8, 3.6), constrained_layout=True)
for model, group in simulation_table.groupby("model"):
    axes[0].scatter(
        np.full(len(group), 0 if model == "WGAN-GP" else 1) + rng.normal(0, 0.035, len(group)),
        group["att_dollars"],
        alpha=0.75,
        s=26,
        label=model,
    )
    axes[1].scatter(
        np.full(len(group), 0 if model == "WGAN-GP" else 1) + rng.normal(0, 0.035, len(group)),
        group["sliced_wasserstein"],
        alpha=0.75,
        s=26,
        label=model,
    )
axes[0].axhline(regression_adjustment_att(lalonde)["att_dollars"], color="black", lw=1, ls="--")
axes[0].set_title("ATT across synthetic datasets")
axes[0].set_ylabel("ATT, dollars")
axes[1].set_title("Sliced Wasserstein")
axes[1].set_ylabel("distance")
for ax in axes:
    ax.set_xticks([0, 1])
    ax.set_xticklabels(["WGAN", "DDPM"])
plt.show()

Treatment-Share Ablation

The CPS target has only 185 treated observations out of 16,177 rows. That imbalance is part of the empirical problem. A learned conditional generator lets us change this feature while keeping the fitted conditional row distributions fixed.

The ablation below draws a balanced sample with 500 treated and 500 controls from each fitted generator. This does not estimate a new population parameter. It asks how the regression-adjustment estimand behaves when the rare-treatment geometry is relaxed while the conditional outcome and covariate distributions are generated by the same fitted model.

Code

balanced_context = np.r_[np.ones((500, 1)), np.zeros((500, 1))].astype(np.float32)

wgan_balanced = synthetic_from_wgan(balanced_context)
ddpm_balanced = synthetic_from_ddpm(balanced_context)

ablation_table = pd.DataFrame([
    result_row("WGAN-GP observed share", wgan_fake),
    result_row("DDPM observed share", ddpm_fake),
    result_row("WGAN-GP balanced 500/500", wgan_balanced),
    result_row("DDPM balanced 500/500", ddpm_balanced),
])
ablation_table.to_csv(RESULTS_DIR / "lalonde_treatment_share_ablation.csv", index=False)
ablation_table.round(4).style.hide(axis="index")

sample	att_thousand	att_dollars	se_thousand	se_dollars	n	n_treated	n_control	status	treated_share
WGAN-GP observed share	0.325000	324.978700	0.093900	93.919900	16177	185	15992	ok	0.011400
DDPM observed share	1.887900	1887.869900	0.476900	476.898700	16177	185	15992	ok	0.011400
WGAN-GP balanced 500/500	0.426500	426.503000	0.050900	50.851200	1000	500	500	ok	0.500000
DDPM balanced 500/500	0.196400	196.430500	0.284500	284.450600	1000	500	500	ok	0.500000

Code

ablation_plot = ablation_table.query("sample != 'Observed CPS'").copy()
fig, ax = plt.subplots(figsize=(8, 3.6), constrained_layout=True)
ax.barh(ablation_plot["sample"], ablation_plot["att_dollars"])
ax.axvline(regression_adjustment_att(lalonde)["att_dollars"], color="black", lw=1, ls="--")
ax.set_xlabel("regression-adjustment ATT, dollars")
ax.set_title("ATT under treatment-share ablation")
plt.show()

This is the kind of ablation that learned simulation designs should make routine. The empirical target is used to fit the conditional row distribution; the analyst then changes one design feature and watches the downstream estimator respond. More substantive ablations would alter overlap, tail thickness, panel persistence, market concentration, or the outcome equation. The same implementation pattern applies: hold the fitted conditional generator fixed, change the design object deliberately, and report both distributional diagnostics and estimator consequences.

The balanced-sample exercise is a controlled perturbation of the simulation environment rather than an estimate of the original CPS estimand. The observed-share synthetic rows keep the empirical 1.14 percent treated share, while the balanced ablation has a treated share of 50 percent. The negative DDPM value in the ablation table is an estimated ATT in dollars, not a treatment share. It flags an important sensitivity: the conditional treated and control distributions learned from a rare-treatment sample should not be extrapolated casually to a balanced population without additional diagnostics.

Limitations

Diffusion models do not remove judgment from simulation design. They move the judgment to preprocessing, conditioning, diagnostics, and ablation design. That is still progress, because those choices are visible and auditable.

There are also limits:

a DDPM trained on one empirical dataset cannot invent support that is absent from that dataset;
synthetic draws inherit sampling noise and measurement artifacts from the source data;
postprocessing can hide model failures if it is too aggressive;
small treatment cells remain hard because the conditional distribution is weakly learned;
privacy is not automatic and should not be claimed without a separate privacy analysis.

Finally, DDPMs are not always better than WGANs. If the table is small, smooth, and continuous, a WGAN or normalizing flow may be enough. If fast generation is essential, WGANs have a real advantage. If exact likelihood is required, flows are more natural. The case for DDPMs is strongest when the simulation target has complicated support, mixed variable types, and estimator-relevant dependence patterns that adversarial training struggles to preserve.

Conclusion

The AIMM paper made an important methodological point: Monte Carlo studies can be made more credible by learning simulation designs from real applications. The next step is to broaden both the search over simulation objects and the search over baselines. Sometimes a resampling design will be enough. Sometimes a copula or flow will be the cleanest fit. Sometimes a WGAN will match a target estimand well even if it misses the broader row geometry. And sometimes a DDPM will be preferable because it preserves more of the dependence structure needed for later ablations.

That is why the right evaluation target is not simply whether one model wins a horse race on one downstream estimand. The central question is whether the learned design preserves the empirical geometry that makes the estimator succeed or fail, and whether it supports controlled perturbations of that geometry. On that criterion, diffusion is promising, but the more durable contribution is the design framework rather than the claim that one generator uniformly dominates all others.

References

Arjovsky, Martin, Soumith Chintala, and Leon Bottou. 2017. “Wasserstein Generative Adversarial Networks.” In Proceedings of the 34th International Conference on Machine Learning, 214–23.

Athey, Susan, Guido W. Imbens, Jonas Metzger, and Evan Munro. 2024. “Using Wasserstein Generative Adversarial Networks for the Design of Monte Carlo Simulations.” Journal of Econometrics.

Austin, Jacob, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. 2021. “Structured Denoising Diffusion Models in Discrete State-Spaces.” In Advances in Neural Information Processing Systems.

Gulrajani, Ishaan, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. 2017. “Improved Training of Wasserstein GANs.” In Advances in Neural Information Processing Systems.

Ho, Jonathan, Ajay Jain, and Pieter Abbeel. 2020. “Denoising Diffusion Probabilistic Models.” In Advances in Neural Information Processing Systems.

Hoogeboom, Emiel, Didrik Nielsen, Priyank Jaini, Patrick Forré, and Max Welling. 2021. “Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions.” arXiv Preprint arXiv:2102.05379.

Kotelnikov, Akim, Dmitry Baranchuk, Ivan Rubachev, and Artem Babenko. 2023. “TabDDPM: Modelling Tabular Data with Diffusion Models.” In Proceedings of the 40th International Conference on Machine Learning, 17564–79.

Nichol, Alexander Quinn, and Prafulla Dhariwal. 2021. “Improved Denoising Diffusion Probabilistic Models.” In Proceedings of the 38th International Conference on Machine Learning, 8162–71.

Papamakarios, George, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshminarayanan. 2021. “Normalizing Flows for Probabilistic Modeling and Inference.” Journal of Machine Learning Research 22 (57): 1–64.

Song, Jiaming, Chenlin Meng, and Stefano Ermon. 2021. “Denoising Diffusion Implicit Models.” In International Conference on Learning Representations.

Song, Yang, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2021. “Score-Based Generative Modeling Through Stochastic Differential Equations.” In International Conference on Learning Representations.