Panel Estimator Bakeoff

Formal estimators, synthetic checks, and the TROP semi-synthetic benchmark

Executive Read

This page is the clean memo version of the current bakeoff. It has three jobs:

Define the estimators being compared.
Report the synthetic panel results where the untreated path is known.
Report the semi-synthetic Table 1-style benchmark from the TROP replication archive.

The headline after the latest rerun is:

The original global MTGPAX_MAP is now numerically finite on the full TROP semi-synthetic study, but it is too variable.
The useful MTGP variant is a scalar-effect objective with unpenalized additive unit and time fixed effects:

\[ Y_{it} = a_i + b_t + m_{it} + \tau W_{it} + \varepsilon_{it}. \]

On the 21-design, 100-replication TROP semi-synthetic benchmark, TROP still has the best mean RMSE (0.1029). The scalar-effect fixed-effect MTGP is second, essentially tied with matrix completion on mean RMSE (0.1192), with lower absolute bias but higher variance.
All methods in the final full rerun are finite in all 2,100 design-replication cells.

Panel Setup

Let \(Y_{it}\) denote the observed outcome for unit \(i=1,\ldots,N\) and period \(t=1,\ldots,T\). Let \(W_{it}=1\) mark treated post-period cells. In the simple single-adoption case used for exposition, unit \(i=0\) is treated from period \(T_0\) onward, so

\[ W_{it} = \mathbf{1}\{i=0,\ t\ge T_0\}. \]

The target path is the missing untreated potential outcome \(Y_{0t}(0)\) for \(t\ge T_0\), and the average treatment effect on the treated is

\[ \tau_{\mathrm{ATT}} = \frac{1}{T-T_0}\sum_{t=T_0}^{T-1} \{Y_{0t} - Y_{0t}(0)\}. \]

Most estimators below either impute the missing path \(\widehat Y_{0t}(0)\) and average residuals, or directly estimate a scalar \(\widehat\tau\) inside an objective.

Estimators

MTGPAX MAP

The original MTGPAX estimator is a low-rank multi-task GP outcome model for the untreated surface. In its current MAP implementation,

\[ m_{it} = \nu_i + \sum_{r=1}^R \beta_{ir} u_{rt}, \qquad u_r \sim \mathcal{GP}(0, K_t). \]

It fits only observed untreated cells

\[ \mathcal{O}= \{(i,t): W_{it}=0\}, \]

then estimates effects by post-treatment residuals:

\[ \widehat\tau_{\mathrm{MTGPAX}} = \frac{1}{|\mathcal{M}|}\sum_{(i,t)\in\mathcal{M}} \{Y_{it}-\widehat m_{it}\}, \]

where \(\mathcal{M}=\{(i,t): W_{it}=1\}\). The model now uses bounded scale transforms, gradient clipping, parameter clipping, and nonfinite-gradient sanitization. Those changes fixed the nonfinite optimizer issue in the final full rerun, but the estimator remains high-variance on the TROP benchmark.

Scalar-Effect Fixed-Effect MTGP

The successful MTGP variant estimates the treatment effect jointly instead of fitting a full post-period residual path. Its likelihood includes the treated post-period cells and forces the treatment signal through one scalar:

\[ Y_{it} = a_i + b_t + m_{it} + \tau W_{it} + \varepsilon_{it}, \]

with

\[ m_{it} = \sum_{r=1}^R \beta_{ir}u_{rt}, \qquad u_r\sim\mathcal{GP}(0,K_t). \]

The implemented MAP objective is

\[ \min_{a,b,\beta,u,\tau} \sum_{it} q_{it} \{Y_{it}-a_i-b_t-m_{it}-\tau W_{it}\}^2 + \lambda_m \mathcal{P}_{\mathrm{MTGP}}(m), \]

where \(a_i\) and \(b_t\) are centered for identification and left unpenalized. The scalar \(\tau\) is also unpenalized by default. Only the MTGP surface terms are regularized. This matters: penalizing \(\tau\) or omitting the additive fixed effects materially worsened the crude ablation.

The reported row SCALAR_EFFECT_FE_MTGP uses \(q_{it}=1\). The row TROP_WEIGHTED_SCALAR_FE_MTGP uses TROP-style product weights in the same objective:

\[ q_{it} = \omega_i \theta_t. \]

TROP, SDID, SC, DID, MC, and DIFP

The semi-synthetic section calls the six estimators in the ZBW TROP replication archive directly. The main TROP row solves a weighted two-way fixed-effect plus nuclear-norm problem with treatment included as a scalar regressor:

\[ (\widehat\mu,\widehat\alpha,\widehat\gamma,\widehat\tau,\widehat L) = \arg\min_{\mu,\alpha,\gamma,\tau,L} \sum_{i,t} \left[ \delta_{it} \{Y_{it}-\mu-\alpha_i-\gamma_t-\tau W_{it}-L_{it}\} \right]^2 + \lambda_{\mathrm{nn}}\|L\|_*. \]

Here \(\delta_{it}\) is the product of exponential unit and time localization weights. The archive multiplies residuals by \(\delta_{it}\) before squaring, so the effective squared-loss weight is \(\delta_{it}^2\). Tuning follows the archive’s placebo cross-validation routine over unit localization, time localization, and nuclear-norm penalty.

The related archive rows are:

SDID: synthetic difference-in-differences style two-way balancing.
SC: synthetic control.
DID: two-way difference-in-differences.
MC: the same weighted TWFE/nuclear-norm solver with localization switched off.
DIFP: the archive’s factor/imputation comparator.

The main conceptual distinction is that TROP is target-local and diagonal in its weighting, while MTGPAX is a coherent global outcome model. The scalar-effect fixed-effect MTGP is an attempt to keep the MTGP surface while adopting the low-dimensional treatment-effect channel and unpenalized additive structure that help TROP and SDID.

Other Synthetic-Only Comparators

The earlier synthetic path bakeoff also includes Graph-Laplacian MTGP, Spectral Laplacian MTGP, Dynamax factor SSM, SYNTH-lite, SDID-lite, and TROP-lite. Those are diagnostic comparators, not the final TROP archive run. They are useful for understanding whether graph locality, forecasting, or TROP-style residual balancing helps on a known synthetic factor DGP.

Synthetic Data Bakeoff

The synthetic panel DGP is a two-factor untreated surface with nonlinear time curvature. The treated unit has high factor loadings, so simple donor averaging and flat extrapolation are deliberately stressed. Here the untreated post-treatment path is known, so we can score both ATT error and pointwise counterfactual path error.

For the displayed seed:

quantity	value
true ATT	-0.7525
MTGPAX ATT	-0.5066
Graph-Laplacian MTGP ATT	-0.4766
Spectral Laplacian MTGP ATT	-0.1018
Dynamax factor SSM ATT	-0.6924
TROP nuclear ATT	0.3903
MTGPAX CF RMSE	0.2614
Graph-Laplacian MTGP CF RMSE	0.3053
Spectral Laplacian MTGP CF RMSE	0.7113
Dynamax factor SSM CF RMSE	0.1196
TROP nuclear CF RMSE	1.1585

Across 20 seeds:

method	ATT RMSE	mean abs ATT error	mean CF RMSE
MTGPAX	0.1200	0.0926	0.1279
Dynamax factor SSM	0.1431	0.1109	0.1613
Graph-Laplacian MTGP	0.3042	0.2589	0.2947
TROP-lite	0.4049	0.3719	0.3966
Spectral Laplacian MTGP	0.6235	0.5309	0.5780
TROP nuclear	0.9234	0.8602	0.8846
SDID-lite	1.0026	0.8910	0.9343
SYNTH-lite	1.3280	1.1615	1.1999

On this specific factor DGP, the original MTGPAX outcome model is the best of the earlier synthetic-path estimators. That result should not be overread: this DGP is friendly to a global low-rank GP surface. The TROP semi-synthetic benchmark below is a different test, built from empirical panels and target-local assignment mechanisms.

TROP Semi-Synthetic Setup

The TROP-paper-style benchmark follows the ZBW replication archive. There are 21 designs: CPS, PWT, Germany, Basque, Smoking, and Boatlift panels crossed with policy, random, simulated, or original-treated-unit assignment regimes. Each design uses 100 Monte Carlo replications, for 2,100 simulated panels.

For each source panel, the observed outcome matrix is normalized and decomposed with a rank-4 SVD. The low-rank reconstruction is split into:

\(F\): additive unit and time structure from row and column means;
\(M\): the remaining rank-4 interactive component after removing \(F\);
\(E\): residuals from the rank-4 reconstruction.

The residuals are used to fit an AR(2) time-series covariance. A simulated panel is drawn as

\[ Y = F + M + \varepsilon, \]

where \(\varepsilon\) is drawn independently by unit from the fitted AR(2) covariance. No treatment effect is added, so the true effect is zero in every replication. The estimator’s scalar effect estimate is therefore its estimation error.

The assignment mechanisms are:

policy: estimate treatment propensities from original assignment on rank-4 unit factors, then sample treated units, capped at 10;
random: sample 10 treated units uniformly at random;
simulated: sample from propensities based on the archive’s synthetic-control weights for the originally treated unit;
treated_unit: keep the original treated unit fixed and redraw AR(2) noise.

All designs use the final 10 periods as treated periods. The archive estimators return scalar treatment-effect estimates. MTGPAX_MAP is the old residual-path MTGP averaged over treated post cells. The two scalar-effect MTGP variants return the jointly estimated \(\widehat\tau\).

Semi-Synthetic Results

The final full rerun has no nonfinite estimates: every estimator below has 2,100/2,100 finite values.

method	designs	finite_reps	mean_RMSE	median_RMSE	mean_abs_bias	mean_variance
TROP	21	2100	0.1029	0.0500	0.0384	0.0141
SCALAR_EFFECT_FE_MTGP	21	2100	0.1192	0.0498	0.0237	0.0279
MC	21	2100	0.1192	0.0546	0.0529	0.0162
TROP_WEIGHTED_SCALAR_FE_MTGP	21	2100	0.1229	0.0632	0.0358	0.0261
SDID	21	2100	0.1343	0.0724	0.0606	0.0228
DIFP	21	2100	0.1415	0.0736	0.0589	0.0257
SC	21	2100	0.1433	0.0812	0.0726	0.0185
DID	21	2100	0.2130	0.1687	0.1220	0.0343
MTGPAX_MAP	21	2100	0.3182	0.1280	0.0465	0.2311

The design-level winners are:

key	dataset	assignment	method	rmse	bias	variance
basque_random	Basque	random	SCALAR_EFFECT_FE_MTGP	0.0297	0.0026	0.0009
basque_simulated	Basque	simulated	TROP	0.0510	-0.0038	0.0026
basque_treated_unit	Basque	treated_unit	SC	0.0344	0.0015	0.0012
boatlift_random	Boatlift	random	TROP	0.1220	0.0255	0.0144
boatlift_simulated	Boatlift	simulated	TROP	0.2963	0.0446	0.0867
boatlift_treated_unit	Boatlift	treated_unit	SC	0.1453	0.0217	0.0208
cps_hours_minwage	CPS	policy	MC	0.1668	0.0560	0.0249
cps_logwage_abortion	CPS	policy	TROP	0.0238	0.0037	0.0006
cps_logwage_gunlaw	CPS	policy	TROP	0.0281	0.0036	0.0008
cps_logwage_minwage	CPS	policy	TROP	0.0272	0.0083	0.0007
cps_logwage_random	CPS	random	TROP	0.0248	-0.0036	0.0006
cps_urate_minwage	CPS	policy	SCALAR_EFFECT_FE_MTGP	0.1589	0.0368	0.0241
germany_random	Germany	random	TROP_WEIGHTED_SCALAR_FE_MTGP	0.0250	0.0049	0.0006
germany_simulated	Germany	simulated	TROP_WEIGHTED_SCALAR_FE_MTGP	0.0366	-0.0027	0.0013
germany_treated_unit	Germany	treated_unit	TROP_WEIGHTED_SCALAR_FE_MTGP	0.0386	-0.0153	0.0013
pwt_loggdp_democracy	PWT	policy	TROP	0.0245	0.0107	0.0005
pwt_loggdp_education	PWT	policy	TROP	0.0261	0.0086	0.0006
pwt_loggdp_random	PWT	random	TROP	0.0239	-0.0040	0.0006
smoking_random	Smoking	random	TROP	0.0898	0.0037	0.0081
smoking_simulated	Smoking	simulated	SCALAR_EFFECT_FE_MTGP	0.2226	0.0098	0.0500
smoking_treated_unit	Smoking	treated_unit	TROP_WEIGHTED_SCALAR_FE_MTGP	0.2274	-0.0500	0.0497

The compact design-level comparison below keeps the main rows in view:

key	dataset	assignment	TROP	Scalar FE MTGP	TROP-weighted scalar FE MTGP	MC	old MTGPAX MAP
cps_logwage_minwage	CPS	policy	0.0272	0.0498	0.0632	0.0371	0.0720
cps_urate_minwage	CPS	policy	0.1846	0.1589	0.1819	0.2280	0.2023
cps_hours_minwage	CPS	policy	0.1709	0.2548	0.2523	0.1668	0.3193
cps_logwage_gunlaw	CPS	policy	0.0281	0.0388	0.0431	0.0355	0.1240
cps_logwage_abortion	CPS	policy	0.0238	0.0358	0.0419	0.0303	0.0749
cps_logwage_random	CPS	random	0.0248	0.0340	0.0447	0.0303	0.0938
pwt_loggdp_democracy	PWT	policy	0.0245	0.0333	0.0324	0.0435	0.0683
pwt_loggdp_education	PWT	policy	0.0261	0.0391	0.0308	0.0426	0.0835
pwt_loggdp_random	PWT	random	0.0239	0.0278	0.0349	0.0331	0.0491
germany_random	Germany	random	0.0258	0.0272	0.0250	0.0331	0.0688
germany_simulated	Germany	simulated	0.0500	0.0388	0.0366	0.0670	0.0964
germany_treated_unit	Germany	treated_unit	0.0464	0.0439	0.0386	0.0523	0.1217
basque_random	Basque	random	0.0463	0.0297	0.0405	0.0546	0.1280
basque_simulated	Basque	simulated	0.0510	0.0609	0.0633	0.0732	0.1574
basque_treated_unit	Basque	treated_unit	0.0621	0.0802	0.1009	0.0382	0.1830
smoking_random	Smoking	random	0.0898	0.0917	0.0911	0.1028	0.3237
smoking_simulated	Smoking	simulated	0.2702	0.2226	0.2562	0.3171	0.7739
smoking_treated_unit	Smoking	treated_unit	0.4089	0.2549	0.2274	0.5007	1.0099
boatlift_random	Boatlift	random	0.1220	0.1403	0.1412	0.1265	0.4168
boatlift_simulated	Boatlift	simulated	0.2963	0.4453	0.4478	0.3300	1.1575
boatlift_treated_unit	Boatlift	treated_unit	0.1590	0.3950	0.3872	0.1603	1.1584

MTGP Variant Ablation

Before running the full 21-design study, the six-design crude ablation tested the relevant MTGP variants. The important result is that the improvement comes from jointly estimating a scalar effect and adding unpenalized additive fixed effects. Naive TROP-weighting of the old MTGP or residual correction after the fact was not enough.

method	designs	finite_reps	mean RMSE	median RMSE	mean abs bias	mean variance
SCALAR_EFFECT_FE_MTGP	6	72	0.0784	0.0736	0.0195	0.0084
TROP	6	72	0.0807	0.0588	0.0332	0.0090
TROP_WEIGHTED_SCALAR_FE_MTGP	6	72	0.0809	0.0704	0.0230	0.0088
MC	6	72	0.0875	0.0810	0.0415	0.0070
SDID	6	72	0.0878	0.0826	0.0375	0.0085
DIFP	6	72	0.1101	0.0897	0.0465	0.0138
SC	6	72	0.1216	0.1466	0.0524	0.0147
SCALAR_EFFECT_MTGP	6	72	0.1250	0.0774	0.0675	0.0236
DID	6	72	0.1569	0.1372	0.0899	0.0146
SPECTRAL_LAPLACIAN_MTGP	6	72	0.1887	0.0873	0.0551	0.0662
MTGPAX_MAP	6	72	0.1966	0.1890	0.0634	0.0535
TROP_WEIGHTED_SCALAR_MTGP	6	72	0.1982	0.0869	0.1609	0.0099
MTGP_TROP_RESIDUALIZED	6	72	0.2031	0.1878	0.0606	0.0575
GRAPH_LAPLACIAN_MTGP	6	72	0.2219	0.1209	0.0823	0.0781
SVD_RANK4_DYNAMIC	6	72	0.2254	0.1867	0.1578	0.0291
TROP_WEIGHTED_MTGP_MAP	6	61	0.2458	0.2414	0.1552	0.0424

Interpretation

The clean read is:

SCALAR_EFFECT_FE_MTGP is the right MTGP direction for this benchmark. It reduces mean RMSE from 0.3182 for old MTGPAX_MAP to 0.1192, fixes the earlier convergence problem, and has the lowest mean absolute bias.
TROP remains the best overall estimator on its native semi-synthetic benchmark, with mean RMSE 0.1029 and much lower variance than scalar-FE MTGP.
The scalar-FE MTGP variants win several designs, especially CPS unemployment, Germany, Basque random, and Smoking, but they lose on the harder Boatlift simulated and treated-unit settings.
The remaining MTGP gap is not primarily bias. It is variance control in small or high-noise panels.

That makes the next technical target fairly specific: keep the scalar treatment-effect channel and the unpenalized additive fixed effects, but add stronger variance control for the MTGP surface. TROP’s target-local weighting is not a drop-in fix by itself; it helps in some Germany/Smoking settings but does not dominate the unweighted scalar-FE objective.