Panel Estimator Bakeoff

Formal estimators, synthetic checks, and the TROP semi-synthetic benchmark

Executive Read

This page is the clean memo version of the current bakeoff. It has three jobs:

  1. Define the estimators being compared.
  2. Report the synthetic panel results where the untreated path is known.
  3. Report the semi-synthetic Table 1-style benchmark from the TROP replication archive.

The headline after the latest rerun is:

  • The original global MTGPAX_MAP is now numerically finite on the full TROP semi-synthetic study, but it is too variable.
  • The useful MTGP variant is a scalar-effect objective with unpenalized additive unit and time fixed effects:

\[ Y_{it} = a_i + b_t + m_{it} + \tau W_{it} + \varepsilon_{it}. \]

  • On the 21-design, 100-replication TROP semi-synthetic benchmark, TROP still has the best mean RMSE (0.1029). The scalar-effect fixed-effect MTGP is second, essentially tied with matrix completion on mean RMSE (0.1192), with lower absolute bias but higher variance.
  • All methods in the final full rerun are finite in all 2,100 design-replication cells.

Panel Setup

Let \(Y_{it}\) denote the observed outcome for unit \(i=1,\ldots,N\) and period \(t=1,\ldots,T\). Let \(W_{it}=1\) mark treated post-period cells. In the simple single-adoption case used for exposition, unit \(i=0\) is treated from period \(T_0\) onward, so

\[ W_{it} = \mathbf{1}\{i=0,\ t\ge T_0\}. \]

The target path is the missing untreated potential outcome \(Y_{0t}(0)\) for \(t\ge T_0\), and the average treatment effect on the treated is

\[ \tau_{\mathrm{ATT}} = \frac{1}{T-T_0}\sum_{t=T_0}^{T-1} \{Y_{0t} - Y_{0t}(0)\}. \]

Most estimators below either impute the missing path \(\widehat Y_{0t}(0)\) and average residuals, or directly estimate a scalar \(\widehat\tau\) inside an objective.

Estimators

MTGPAX MAP

The original MTGPAX estimator is a low-rank multi-task GP outcome model for the untreated surface. In its current MAP implementation,

\[ m_{it} = \nu_i + \sum_{r=1}^R \beta_{ir} u_{rt}, \qquad u_r \sim \mathcal{GP}(0, K_t). \]

It fits only observed untreated cells

\[ \mathcal{O}= \{(i,t): W_{it}=0\}, \]

then estimates effects by post-treatment residuals:

\[ \widehat\tau_{\mathrm{MTGPAX}} = \frac{1}{|\mathcal{M}|}\sum_{(i,t)\in\mathcal{M}} \{Y_{it}-\widehat m_{it}\}, \]

where \(\mathcal{M}=\{(i,t): W_{it}=1\}\). The model now uses bounded scale transforms, gradient clipping, parameter clipping, and nonfinite-gradient sanitization. Those changes fixed the nonfinite optimizer issue in the final full rerun, but the estimator remains high-variance on the TROP benchmark.

Scalar-Effect Fixed-Effect MTGP

The successful MTGP variant estimates the treatment effect jointly instead of fitting a full post-period residual path. Its likelihood includes the treated post-period cells and forces the treatment signal through one scalar:

\[ Y_{it} = a_i + b_t + m_{it} + \tau W_{it} + \varepsilon_{it}, \]

with

\[ m_{it} = \sum_{r=1}^R \beta_{ir}u_{rt}, \qquad u_r\sim\mathcal{GP}(0,K_t). \]

The implemented MAP objective is

\[ \min_{a,b,\beta,u,\tau} \sum_{it} q_{it} \{Y_{it}-a_i-b_t-m_{it}-\tau W_{it}\}^2 + \lambda_m \mathcal{P}_{\mathrm{MTGP}}(m), \]

where \(a_i\) and \(b_t\) are centered for identification and left unpenalized. The scalar \(\tau\) is also unpenalized by default. Only the MTGP surface terms are regularized. This matters: penalizing \(\tau\) or omitting the additive fixed effects materially worsened the crude ablation.

The reported row SCALAR_EFFECT_FE_MTGP uses \(q_{it}=1\). The row TROP_WEIGHTED_SCALAR_FE_MTGP uses TROP-style product weights in the same objective:

\[ q_{it} = \omega_i \theta_t. \]

TROP, SDID, SC, DID, MC, and DIFP

The semi-synthetic section calls the six estimators in the ZBW TROP replication archive directly. The main TROP row solves a weighted two-way fixed-effect plus nuclear-norm problem with treatment included as a scalar regressor:

\[ (\widehat\mu,\widehat\alpha,\widehat\gamma,\widehat\tau,\widehat L) = \arg\min_{\mu,\alpha,\gamma,\tau,L} \sum_{i,t} \left[ \delta_{it} \{Y_{it}-\mu-\alpha_i-\gamma_t-\tau W_{it}-L_{it}\} \right]^2 + \lambda_{\mathrm{nn}}\|L\|_*. \]

Here \(\delta_{it}\) is the product of exponential unit and time localization weights. The archive multiplies residuals by \(\delta_{it}\) before squaring, so the effective squared-loss weight is \(\delta_{it}^2\). Tuning follows the archive’s placebo cross-validation routine over unit localization, time localization, and nuclear-norm penalty.

The related archive rows are:

  • SDID: synthetic difference-in-differences style two-way balancing.
  • SC: synthetic control.
  • DID: two-way difference-in-differences.
  • MC: the same weighted TWFE/nuclear-norm solver with localization switched off.
  • DIFP: the archive’s factor/imputation comparator.

The main conceptual distinction is that TROP is target-local and diagonal in its weighting, while MTGPAX is a coherent global outcome model. The scalar-effect fixed-effect MTGP is an attempt to keep the MTGP surface while adopting the low-dimensional treatment-effect channel and unpenalized additive structure that help TROP and SDID.

Other Synthetic-Only Comparators

The earlier synthetic path bakeoff also includes Graph-Laplacian MTGP, Spectral Laplacian MTGP, Dynamax factor SSM, SYNTH-lite, SDID-lite, and TROP-lite. Those are diagnostic comparators, not the final TROP archive run. They are useful for understanding whether graph locality, forecasting, or TROP-style residual balancing helps on a known synthetic factor DGP.

Synthetic Data Bakeoff

The synthetic panel DGP is a two-factor untreated surface with nonlinear time curvature. The treated unit has high factor loadings, so simple donor averaging and flat extrapolation are deliberately stressed. Here the untreated post-treatment path is known, so we can score both ATT error and pointwise counterfactual path error.

For the displayed seed:

quantity value
true ATT -0.7525
MTGPAX ATT -0.5066
Graph-Laplacian MTGP ATT -0.4766
Spectral Laplacian MTGP ATT -0.1018
Dynamax factor SSM ATT -0.6924
TROP nuclear ATT 0.3903
MTGPAX CF RMSE 0.2614
Graph-Laplacian MTGP CF RMSE 0.3053
Spectral Laplacian MTGP CF RMSE 0.7113
Dynamax factor SSM CF RMSE 0.1196
TROP nuclear CF RMSE 1.1585

Across 20 seeds:

method ATT RMSE mean abs ATT error mean CF RMSE
MTGPAX 0.1200 0.0926 0.1279
Dynamax factor SSM 0.1431 0.1109 0.1613
Graph-Laplacian MTGP 0.3042 0.2589 0.2947
TROP-lite 0.4049 0.3719 0.3966
Spectral Laplacian MTGP 0.6235 0.5309 0.5780
TROP nuclear 0.9234 0.8602 0.8846
SDID-lite 1.0026 0.8910 0.9343
SYNTH-lite 1.3280 1.1615 1.1999

On this specific factor DGP, the original MTGPAX outcome model is the best of the earlier synthetic-path estimators. That result should not be overread: this DGP is friendly to a global low-rank GP surface. The TROP semi-synthetic benchmark below is a different test, built from empirical panels and target-local assignment mechanisms.

TROP Semi-Synthetic Setup

The TROP-paper-style benchmark follows the ZBW replication archive. There are 21 designs: CPS, PWT, Germany, Basque, Smoking, and Boatlift panels crossed with policy, random, simulated, or original-treated-unit assignment regimes. Each design uses 100 Monte Carlo replications, for 2,100 simulated panels.

For each source panel, the observed outcome matrix is normalized and decomposed with a rank-4 SVD. The low-rank reconstruction is split into:

  • \(F\): additive unit and time structure from row and column means;
  • \(M\): the remaining rank-4 interactive component after removing \(F\);
  • \(E\): residuals from the rank-4 reconstruction.

The residuals are used to fit an AR(2) time-series covariance. A simulated panel is drawn as

\[ Y = F + M + \varepsilon, \]

where \(\varepsilon\) is drawn independently by unit from the fitted AR(2) covariance. No treatment effect is added, so the true effect is zero in every replication. The estimator’s scalar effect estimate is therefore its estimation error.

The assignment mechanisms are:

  • policy: estimate treatment propensities from original assignment on rank-4 unit factors, then sample treated units, capped at 10;
  • random: sample 10 treated units uniformly at random;
  • simulated: sample from propensities based on the archive’s synthetic-control weights for the originally treated unit;
  • treated_unit: keep the original treated unit fixed and redraw AR(2) noise.

All designs use the final 10 periods as treated periods. The archive estimators return scalar treatment-effect estimates. MTGPAX_MAP is the old residual-path MTGP averaged over treated post cells. The two scalar-effect MTGP variants return the jointly estimated \(\widehat\tau\).

Semi-Synthetic Results

The final full rerun has no nonfinite estimates: every estimator below has 2,100/2,100 finite values.

method designs finite_reps mean_RMSE median_RMSE mean_abs_bias mean_variance
TROP 21 2100 0.1029 0.0500 0.0384 0.0141
SCALAR_EFFECT_FE_MTGP 21 2100 0.1192 0.0498 0.0237 0.0279
MC 21 2100 0.1192 0.0546 0.0529 0.0162
TROP_WEIGHTED_SCALAR_FE_MTGP 21 2100 0.1229 0.0632 0.0358 0.0261
SDID 21 2100 0.1343 0.0724 0.0606 0.0228
DIFP 21 2100 0.1415 0.0736 0.0589 0.0257
SC 21 2100 0.1433 0.0812 0.0726 0.0185
DID 21 2100 0.2130 0.1687 0.1220 0.0343
MTGPAX_MAP 21 2100 0.3182 0.1280 0.0465 0.2311

The design-level winners are:

key dataset assignment method rmse bias variance
basque_random Basque random SCALAR_EFFECT_FE_MTGP 0.0297 0.0026 0.0009
basque_simulated Basque simulated TROP 0.0510 -0.0038 0.0026
basque_treated_unit Basque treated_unit SC 0.0344 0.0015 0.0012
boatlift_random Boatlift random TROP 0.1220 0.0255 0.0144
boatlift_simulated Boatlift simulated TROP 0.2963 0.0446 0.0867
boatlift_treated_unit Boatlift treated_unit SC 0.1453 0.0217 0.0208
cps_hours_minwage CPS policy MC 0.1668 0.0560 0.0249
cps_logwage_abortion CPS policy TROP 0.0238 0.0037 0.0006
cps_logwage_gunlaw CPS policy TROP 0.0281 0.0036 0.0008
cps_logwage_minwage CPS policy TROP 0.0272 0.0083 0.0007
cps_logwage_random CPS random TROP 0.0248 -0.0036 0.0006
cps_urate_minwage CPS policy SCALAR_EFFECT_FE_MTGP 0.1589 0.0368 0.0241
germany_random Germany random TROP_WEIGHTED_SCALAR_FE_MTGP 0.0250 0.0049 0.0006
germany_simulated Germany simulated TROP_WEIGHTED_SCALAR_FE_MTGP 0.0366 -0.0027 0.0013
germany_treated_unit Germany treated_unit TROP_WEIGHTED_SCALAR_FE_MTGP 0.0386 -0.0153 0.0013
pwt_loggdp_democracy PWT policy TROP 0.0245 0.0107 0.0005
pwt_loggdp_education PWT policy TROP 0.0261 0.0086 0.0006
pwt_loggdp_random PWT random TROP 0.0239 -0.0040 0.0006
smoking_random Smoking random TROP 0.0898 0.0037 0.0081
smoking_simulated Smoking simulated SCALAR_EFFECT_FE_MTGP 0.2226 0.0098 0.0500
smoking_treated_unit Smoking treated_unit TROP_WEIGHTED_SCALAR_FE_MTGP 0.2274 -0.0500 0.0497

The compact design-level comparison below keeps the main rows in view:

key dataset assignment TROP Scalar FE MTGP TROP-weighted scalar FE MTGP MC old MTGPAX MAP
cps_logwage_minwage CPS policy 0.0272 0.0498 0.0632 0.0371 0.0720
cps_urate_minwage CPS policy 0.1846 0.1589 0.1819 0.2280 0.2023
cps_hours_minwage CPS policy 0.1709 0.2548 0.2523 0.1668 0.3193
cps_logwage_gunlaw CPS policy 0.0281 0.0388 0.0431 0.0355 0.1240
cps_logwage_abortion CPS policy 0.0238 0.0358 0.0419 0.0303 0.0749
cps_logwage_random CPS random 0.0248 0.0340 0.0447 0.0303 0.0938
pwt_loggdp_democracy PWT policy 0.0245 0.0333 0.0324 0.0435 0.0683
pwt_loggdp_education PWT policy 0.0261 0.0391 0.0308 0.0426 0.0835
pwt_loggdp_random PWT random 0.0239 0.0278 0.0349 0.0331 0.0491
germany_random Germany random 0.0258 0.0272 0.0250 0.0331 0.0688
germany_simulated Germany simulated 0.0500 0.0388 0.0366 0.0670 0.0964
germany_treated_unit Germany treated_unit 0.0464 0.0439 0.0386 0.0523 0.1217
basque_random Basque random 0.0463 0.0297 0.0405 0.0546 0.1280
basque_simulated Basque simulated 0.0510 0.0609 0.0633 0.0732 0.1574
basque_treated_unit Basque treated_unit 0.0621 0.0802 0.1009 0.0382 0.1830
smoking_random Smoking random 0.0898 0.0917 0.0911 0.1028 0.3237
smoking_simulated Smoking simulated 0.2702 0.2226 0.2562 0.3171 0.7739
smoking_treated_unit Smoking treated_unit 0.4089 0.2549 0.2274 0.5007 1.0099
boatlift_random Boatlift random 0.1220 0.1403 0.1412 0.1265 0.4168
boatlift_simulated Boatlift simulated 0.2963 0.4453 0.4478 0.3300 1.1575
boatlift_treated_unit Boatlift treated_unit 0.1590 0.3950 0.3872 0.1603 1.1584

MTGP Variant Ablation

Before running the full 21-design study, the six-design crude ablation tested the relevant MTGP variants. The important result is that the improvement comes from jointly estimating a scalar effect and adding unpenalized additive fixed effects. Naive TROP-weighting of the old MTGP or residual correction after the fact was not enough.

method designs finite_reps mean RMSE median RMSE mean abs bias mean variance total_errors
SCALAR_EFFECT_FE_MTGP 6 72 0.0784 0.0736 0.0195 0.0084 0
TROP 6 72 0.0807 0.0588 0.0332 0.0090 0
TROP_WEIGHTED_SCALAR_FE_MTGP 6 72 0.0809 0.0704 0.0230 0.0088 0
MC 6 72 0.0875 0.0810 0.0415 0.0070 0
SDID 6 72 0.0878 0.0826 0.0375 0.0085 0
DIFP 6 72 0.1101 0.0897 0.0465 0.0138 0
SC 6 72 0.1216 0.1466 0.0524 0.0147 0
SCALAR_EFFECT_MTGP 6 72 0.1250 0.0774 0.0675 0.0236 0
DID 6 72 0.1569 0.1372 0.0899 0.0146 0
SPECTRAL_LAPLACIAN_MTGP 6 72 0.1887 0.0873 0.0551 0.0662 0
MTGPAX_MAP 6 72 0.1966 0.1890 0.0634 0.0535 0
TROP_WEIGHTED_SCALAR_MTGP 6 72 0.1982 0.0869 0.1609 0.0099 0
MTGP_TROP_RESIDUALIZED 6 72 0.2031 0.1878 0.0606 0.0575 0
GRAPH_LAPLACIAN_MTGP 6 72 0.2219 0.1209 0.0823 0.0781 0
SVD_RANK4_DYNAMIC 6 72 0.2254 0.1867 0.1578 0.0291 0
TROP_WEIGHTED_MTGP_MAP 6 61 0.2458 0.2414 0.1552 0.0424 0

Interpretation

The clean read is:

  • SCALAR_EFFECT_FE_MTGP is the right MTGP direction for this benchmark. It reduces mean RMSE from 0.3182 for old MTGPAX_MAP to 0.1192, fixes the earlier convergence problem, and has the lowest mean absolute bias.
  • TROP remains the best overall estimator on its native semi-synthetic benchmark, with mean RMSE 0.1029 and much lower variance than scalar-FE MTGP.
  • The scalar-FE MTGP variants win several designs, especially CPS unemployment, Germany, Basque random, and Smoking, but they lose on the harder Boatlift simulated and treated-unit settings.
  • The remaining MTGP gap is not primarily bias. It is variance control in small or high-noise panels.

That makes the next technical target fairly specific: keep the scalar treatment-effect channel and the unpenalized additive fixed effects, but add stronger variance control for the MTGP surface. TROP’s target-local weighting is not a drop-in fix by itself; it helps in some Germany/Smoking settings but does not dominate the unweighted scalar-FE objective.