Panel Estimator Bakeoff
Formal estimators, synthetic checks, and the TROP semi-synthetic benchmark
Executive Read
This page is the clean memo version of the current bakeoff. It has three jobs:
- Define the estimators being compared.
- Report the synthetic panel results where the untreated path is known.
- Report the semi-synthetic Table 1-style benchmark from the TROP replication archive.
The headline after the latest rerun is:
- The original global
MTGPAX_MAPis now numerically finite on the full TROP semi-synthetic study, but it is too variable. - The useful MTGP variant is a scalar-effect objective with unpenalized additive unit and time fixed effects:
\[ Y_{it} = a_i + b_t + m_{it} + \tau W_{it} + \varepsilon_{it}. \]
- On the 21-design, 100-replication TROP semi-synthetic benchmark, TROP still has the best mean RMSE (
0.1029). The scalar-effect fixed-effect MTGP is second, essentially tied with matrix completion on mean RMSE (0.1192), with lower absolute bias but higher variance. - All methods in the final full rerun are finite in all
2,100design-replication cells.
Panel Setup
Let \(Y_{it}\) denote the observed outcome for unit \(i=1,\ldots,N\) and period \(t=1,\ldots,T\). Let \(W_{it}=1\) mark treated post-period cells. In the simple single-adoption case used for exposition, unit \(i=0\) is treated from period \(T_0\) onward, so
\[ W_{it} = \mathbf{1}\{i=0,\ t\ge T_0\}. \]
The target path is the missing untreated potential outcome \(Y_{0t}(0)\) for \(t\ge T_0\), and the average treatment effect on the treated is
\[ \tau_{\mathrm{ATT}} = \frac{1}{T-T_0}\sum_{t=T_0}^{T-1} \{Y_{0t} - Y_{0t}(0)\}. \]
Most estimators below either impute the missing path \(\widehat Y_{0t}(0)\) and average residuals, or directly estimate a scalar \(\widehat\tau\) inside an objective.
Estimators
MTGPAX MAP
The original MTGPAX estimator is a low-rank multi-task GP outcome model for the untreated surface. In its current MAP implementation,
\[ m_{it} = \nu_i + \sum_{r=1}^R \beta_{ir} u_{rt}, \qquad u_r \sim \mathcal{GP}(0, K_t). \]
It fits only observed untreated cells
\[ \mathcal{O}= \{(i,t): W_{it}=0\}, \]
then estimates effects by post-treatment residuals:
\[ \widehat\tau_{\mathrm{MTGPAX}} = \frac{1}{|\mathcal{M}|}\sum_{(i,t)\in\mathcal{M}} \{Y_{it}-\widehat m_{it}\}, \]
where \(\mathcal{M}=\{(i,t): W_{it}=1\}\). The model now uses bounded scale transforms, gradient clipping, parameter clipping, and nonfinite-gradient sanitization. Those changes fixed the nonfinite optimizer issue in the final full rerun, but the estimator remains high-variance on the TROP benchmark.
Scalar-Effect Fixed-Effect MTGP
The successful MTGP variant estimates the treatment effect jointly instead of fitting a full post-period residual path. Its likelihood includes the treated post-period cells and forces the treatment signal through one scalar:
\[ Y_{it} = a_i + b_t + m_{it} + \tau W_{it} + \varepsilon_{it}, \]
with
\[ m_{it} = \sum_{r=1}^R \beta_{ir}u_{rt}, \qquad u_r\sim\mathcal{GP}(0,K_t). \]
The implemented MAP objective is
\[ \min_{a,b,\beta,u,\tau} \sum_{it} q_{it} \{Y_{it}-a_i-b_t-m_{it}-\tau W_{it}\}^2 + \lambda_m \mathcal{P}_{\mathrm{MTGP}}(m), \]
where \(a_i\) and \(b_t\) are centered for identification and left unpenalized. The scalar \(\tau\) is also unpenalized by default. Only the MTGP surface terms are regularized. This matters: penalizing \(\tau\) or omitting the additive fixed effects materially worsened the crude ablation.
The reported row SCALAR_EFFECT_FE_MTGP uses \(q_{it}=1\). The row TROP_WEIGHTED_SCALAR_FE_MTGP uses TROP-style product weights in the same objective:
\[ q_{it} = \omega_i \theta_t. \]
TROP, SDID, SC, DID, MC, and DIFP
The semi-synthetic section calls the six estimators in the ZBW TROP replication archive directly. The main TROP row solves a weighted two-way fixed-effect plus nuclear-norm problem with treatment included as a scalar regressor:
\[ (\widehat\mu,\widehat\alpha,\widehat\gamma,\widehat\tau,\widehat L) = \arg\min_{\mu,\alpha,\gamma,\tau,L} \sum_{i,t} \left[ \delta_{it} \{Y_{it}-\mu-\alpha_i-\gamma_t-\tau W_{it}-L_{it}\} \right]^2 + \lambda_{\mathrm{nn}}\|L\|_*. \]
Here \(\delta_{it}\) is the product of exponential unit and time localization weights. The archive multiplies residuals by \(\delta_{it}\) before squaring, so the effective squared-loss weight is \(\delta_{it}^2\). Tuning follows the archive’s placebo cross-validation routine over unit localization, time localization, and nuclear-norm penalty.
The related archive rows are:
SDID: synthetic difference-in-differences style two-way balancing.SC: synthetic control.DID: two-way difference-in-differences.MC: the same weighted TWFE/nuclear-norm solver with localization switched off.DIFP: the archive’s factor/imputation comparator.
The main conceptual distinction is that TROP is target-local and diagonal in its weighting, while MTGPAX is a coherent global outcome model. The scalar-effect fixed-effect MTGP is an attempt to keep the MTGP surface while adopting the low-dimensional treatment-effect channel and unpenalized additive structure that help TROP and SDID.
Other Synthetic-Only Comparators
The earlier synthetic path bakeoff also includes Graph-Laplacian MTGP, Spectral Laplacian MTGP, Dynamax factor SSM, SYNTH-lite, SDID-lite, and TROP-lite. Those are diagnostic comparators, not the final TROP archive run. They are useful for understanding whether graph locality, forecasting, or TROP-style residual balancing helps on a known synthetic factor DGP.
Synthetic Data Bakeoff
The synthetic panel DGP is a two-factor untreated surface with nonlinear time curvature. The treated unit has high factor loadings, so simple donor averaging and flat extrapolation are deliberately stressed. Here the untreated post-treatment path is known, so we can score both ATT error and pointwise counterfactual path error.
For the displayed seed:
| quantity | value |
|---|---|
| true ATT | -0.7525 |
| MTGPAX ATT | -0.5066 |
| Graph-Laplacian MTGP ATT | -0.4766 |
| Spectral Laplacian MTGP ATT | -0.1018 |
| Dynamax factor SSM ATT | -0.6924 |
| TROP nuclear ATT | 0.3903 |
| MTGPAX CF RMSE | 0.2614 |
| Graph-Laplacian MTGP CF RMSE | 0.3053 |
| Spectral Laplacian MTGP CF RMSE | 0.7113 |
| Dynamax factor SSM CF RMSE | 0.1196 |
| TROP nuclear CF RMSE | 1.1585 |
Across 20 seeds:
| method | ATT RMSE | mean abs ATT error | mean CF RMSE |
|---|---|---|---|
| MTGPAX | 0.1200 | 0.0926 | 0.1279 |
| Dynamax factor SSM | 0.1431 | 0.1109 | 0.1613 |
| Graph-Laplacian MTGP | 0.3042 | 0.2589 | 0.2947 |
| TROP-lite | 0.4049 | 0.3719 | 0.3966 |
| Spectral Laplacian MTGP | 0.6235 | 0.5309 | 0.5780 |
| TROP nuclear | 0.9234 | 0.8602 | 0.8846 |
| SDID-lite | 1.0026 | 0.8910 | 0.9343 |
| SYNTH-lite | 1.3280 | 1.1615 | 1.1999 |
On this specific factor DGP, the original MTGPAX outcome model is the best of the earlier synthetic-path estimators. That result should not be overread: this DGP is friendly to a global low-rank GP surface. The TROP semi-synthetic benchmark below is a different test, built from empirical panels and target-local assignment mechanisms.
TROP Semi-Synthetic Setup
The TROP-paper-style benchmark follows the ZBW replication archive. There are 21 designs: CPS, PWT, Germany, Basque, Smoking, and Boatlift panels crossed with policy, random, simulated, or original-treated-unit assignment regimes. Each design uses 100 Monte Carlo replications, for 2,100 simulated panels.
For each source panel, the observed outcome matrix is normalized and decomposed with a rank-4 SVD. The low-rank reconstruction is split into:
- \(F\): additive unit and time structure from row and column means;
- \(M\): the remaining rank-4 interactive component after removing \(F\);
- \(E\): residuals from the rank-4 reconstruction.
The residuals are used to fit an AR(2) time-series covariance. A simulated panel is drawn as
\[ Y = F + M + \varepsilon, \]
where \(\varepsilon\) is drawn independently by unit from the fitted AR(2) covariance. No treatment effect is added, so the true effect is zero in every replication. The estimator’s scalar effect estimate is therefore its estimation error.
The assignment mechanisms are:
policy: estimate treatment propensities from original assignment on rank-4 unit factors, then sample treated units, capped at 10;random: sample 10 treated units uniformly at random;simulated: sample from propensities based on the archive’s synthetic-control weights for the originally treated unit;treated_unit: keep the original treated unit fixed and redraw AR(2) noise.
All designs use the final 10 periods as treated periods. The archive estimators return scalar treatment-effect estimates. MTGPAX_MAP is the old residual-path MTGP averaged over treated post cells. The two scalar-effect MTGP variants return the jointly estimated \(\widehat\tau\).
Semi-Synthetic Results
The final full rerun has no nonfinite estimates: every estimator below has 2,100/2,100 finite values.
| method | designs | finite_reps | mean_RMSE | median_RMSE | mean_abs_bias | mean_variance |
|---|---|---|---|---|---|---|
| TROP | 21 | 2100 | 0.1029 | 0.0500 | 0.0384 | 0.0141 |
| SCALAR_EFFECT_FE_MTGP | 21 | 2100 | 0.1192 | 0.0498 | 0.0237 | 0.0279 |
| MC | 21 | 2100 | 0.1192 | 0.0546 | 0.0529 | 0.0162 |
| TROP_WEIGHTED_SCALAR_FE_MTGP | 21 | 2100 | 0.1229 | 0.0632 | 0.0358 | 0.0261 |
| SDID | 21 | 2100 | 0.1343 | 0.0724 | 0.0606 | 0.0228 |
| DIFP | 21 | 2100 | 0.1415 | 0.0736 | 0.0589 | 0.0257 |
| SC | 21 | 2100 | 0.1433 | 0.0812 | 0.0726 | 0.0185 |
| DID | 21 | 2100 | 0.2130 | 0.1687 | 0.1220 | 0.0343 |
| MTGPAX_MAP | 21 | 2100 | 0.3182 | 0.1280 | 0.0465 | 0.2311 |
The design-level winners are:
| key | dataset | assignment | method | rmse | bias | variance |
|---|---|---|---|---|---|---|
| basque_random | Basque | random | SCALAR_EFFECT_FE_MTGP | 0.0297 | 0.0026 | 0.0009 |
| basque_simulated | Basque | simulated | TROP | 0.0510 | -0.0038 | 0.0026 |
| basque_treated_unit | Basque | treated_unit | SC | 0.0344 | 0.0015 | 0.0012 |
| boatlift_random | Boatlift | random | TROP | 0.1220 | 0.0255 | 0.0144 |
| boatlift_simulated | Boatlift | simulated | TROP | 0.2963 | 0.0446 | 0.0867 |
| boatlift_treated_unit | Boatlift | treated_unit | SC | 0.1453 | 0.0217 | 0.0208 |
| cps_hours_minwage | CPS | policy | MC | 0.1668 | 0.0560 | 0.0249 |
| cps_logwage_abortion | CPS | policy | TROP | 0.0238 | 0.0037 | 0.0006 |
| cps_logwage_gunlaw | CPS | policy | TROP | 0.0281 | 0.0036 | 0.0008 |
| cps_logwage_minwage | CPS | policy | TROP | 0.0272 | 0.0083 | 0.0007 |
| cps_logwage_random | CPS | random | TROP | 0.0248 | -0.0036 | 0.0006 |
| cps_urate_minwage | CPS | policy | SCALAR_EFFECT_FE_MTGP | 0.1589 | 0.0368 | 0.0241 |
| germany_random | Germany | random | TROP_WEIGHTED_SCALAR_FE_MTGP | 0.0250 | 0.0049 | 0.0006 |
| germany_simulated | Germany | simulated | TROP_WEIGHTED_SCALAR_FE_MTGP | 0.0366 | -0.0027 | 0.0013 |
| germany_treated_unit | Germany | treated_unit | TROP_WEIGHTED_SCALAR_FE_MTGP | 0.0386 | -0.0153 | 0.0013 |
| pwt_loggdp_democracy | PWT | policy | TROP | 0.0245 | 0.0107 | 0.0005 |
| pwt_loggdp_education | PWT | policy | TROP | 0.0261 | 0.0086 | 0.0006 |
| pwt_loggdp_random | PWT | random | TROP | 0.0239 | -0.0040 | 0.0006 |
| smoking_random | Smoking | random | TROP | 0.0898 | 0.0037 | 0.0081 |
| smoking_simulated | Smoking | simulated | SCALAR_EFFECT_FE_MTGP | 0.2226 | 0.0098 | 0.0500 |
| smoking_treated_unit | Smoking | treated_unit | TROP_WEIGHTED_SCALAR_FE_MTGP | 0.2274 | -0.0500 | 0.0497 |
The compact design-level comparison below keeps the main rows in view:
| key | dataset | assignment | TROP | Scalar FE MTGP | TROP-weighted scalar FE MTGP | MC | old MTGPAX MAP |
|---|---|---|---|---|---|---|---|
| cps_logwage_minwage | CPS | policy | 0.0272 | 0.0498 | 0.0632 | 0.0371 | 0.0720 |
| cps_urate_minwage | CPS | policy | 0.1846 | 0.1589 | 0.1819 | 0.2280 | 0.2023 |
| cps_hours_minwage | CPS | policy | 0.1709 | 0.2548 | 0.2523 | 0.1668 | 0.3193 |
| cps_logwage_gunlaw | CPS | policy | 0.0281 | 0.0388 | 0.0431 | 0.0355 | 0.1240 |
| cps_logwage_abortion | CPS | policy | 0.0238 | 0.0358 | 0.0419 | 0.0303 | 0.0749 |
| cps_logwage_random | CPS | random | 0.0248 | 0.0340 | 0.0447 | 0.0303 | 0.0938 |
| pwt_loggdp_democracy | PWT | policy | 0.0245 | 0.0333 | 0.0324 | 0.0435 | 0.0683 |
| pwt_loggdp_education | PWT | policy | 0.0261 | 0.0391 | 0.0308 | 0.0426 | 0.0835 |
| pwt_loggdp_random | PWT | random | 0.0239 | 0.0278 | 0.0349 | 0.0331 | 0.0491 |
| germany_random | Germany | random | 0.0258 | 0.0272 | 0.0250 | 0.0331 | 0.0688 |
| germany_simulated | Germany | simulated | 0.0500 | 0.0388 | 0.0366 | 0.0670 | 0.0964 |
| germany_treated_unit | Germany | treated_unit | 0.0464 | 0.0439 | 0.0386 | 0.0523 | 0.1217 |
| basque_random | Basque | random | 0.0463 | 0.0297 | 0.0405 | 0.0546 | 0.1280 |
| basque_simulated | Basque | simulated | 0.0510 | 0.0609 | 0.0633 | 0.0732 | 0.1574 |
| basque_treated_unit | Basque | treated_unit | 0.0621 | 0.0802 | 0.1009 | 0.0382 | 0.1830 |
| smoking_random | Smoking | random | 0.0898 | 0.0917 | 0.0911 | 0.1028 | 0.3237 |
| smoking_simulated | Smoking | simulated | 0.2702 | 0.2226 | 0.2562 | 0.3171 | 0.7739 |
| smoking_treated_unit | Smoking | treated_unit | 0.4089 | 0.2549 | 0.2274 | 0.5007 | 1.0099 |
| boatlift_random | Boatlift | random | 0.1220 | 0.1403 | 0.1412 | 0.1265 | 0.4168 |
| boatlift_simulated | Boatlift | simulated | 0.2963 | 0.4453 | 0.4478 | 0.3300 | 1.1575 |
| boatlift_treated_unit | Boatlift | treated_unit | 0.1590 | 0.3950 | 0.3872 | 0.1603 | 1.1584 |
MTGP Variant Ablation
Before running the full 21-design study, the six-design crude ablation tested the relevant MTGP variants. The important result is that the improvement comes from jointly estimating a scalar effect and adding unpenalized additive fixed effects. Naive TROP-weighting of the old MTGP or residual correction after the fact was not enough.
| method | designs | finite_reps | mean RMSE | median RMSE | mean abs bias | mean variance | total_errors |
|---|---|---|---|---|---|---|---|
| SCALAR_EFFECT_FE_MTGP | 6 | 72 | 0.0784 | 0.0736 | 0.0195 | 0.0084 | 0 |
| TROP | 6 | 72 | 0.0807 | 0.0588 | 0.0332 | 0.0090 | 0 |
| TROP_WEIGHTED_SCALAR_FE_MTGP | 6 | 72 | 0.0809 | 0.0704 | 0.0230 | 0.0088 | 0 |
| MC | 6 | 72 | 0.0875 | 0.0810 | 0.0415 | 0.0070 | 0 |
| SDID | 6 | 72 | 0.0878 | 0.0826 | 0.0375 | 0.0085 | 0 |
| DIFP | 6 | 72 | 0.1101 | 0.0897 | 0.0465 | 0.0138 | 0 |
| SC | 6 | 72 | 0.1216 | 0.1466 | 0.0524 | 0.0147 | 0 |
| SCALAR_EFFECT_MTGP | 6 | 72 | 0.1250 | 0.0774 | 0.0675 | 0.0236 | 0 |
| DID | 6 | 72 | 0.1569 | 0.1372 | 0.0899 | 0.0146 | 0 |
| SPECTRAL_LAPLACIAN_MTGP | 6 | 72 | 0.1887 | 0.0873 | 0.0551 | 0.0662 | 0 |
| MTGPAX_MAP | 6 | 72 | 0.1966 | 0.1890 | 0.0634 | 0.0535 | 0 |
| TROP_WEIGHTED_SCALAR_MTGP | 6 | 72 | 0.1982 | 0.0869 | 0.1609 | 0.0099 | 0 |
| MTGP_TROP_RESIDUALIZED | 6 | 72 | 0.2031 | 0.1878 | 0.0606 | 0.0575 | 0 |
| GRAPH_LAPLACIAN_MTGP | 6 | 72 | 0.2219 | 0.1209 | 0.0823 | 0.0781 | 0 |
| SVD_RANK4_DYNAMIC | 6 | 72 | 0.2254 | 0.1867 | 0.1578 | 0.0291 | 0 |
| TROP_WEIGHTED_MTGP_MAP | 6 | 61 | 0.2458 | 0.2414 | 0.1552 | 0.0424 | 0 |
Interpretation
The clean read is:
SCALAR_EFFECT_FE_MTGPis the right MTGP direction for this benchmark. It reduces mean RMSE from0.3182for oldMTGPAX_MAPto0.1192, fixes the earlier convergence problem, and has the lowest mean absolute bias.- TROP remains the best overall estimator on its native semi-synthetic benchmark, with mean RMSE
0.1029and much lower variance than scalar-FE MTGP. - The scalar-FE MTGP variants win several designs, especially CPS unemployment, Germany, Basque random, and Smoking, but they lose on the harder Boatlift simulated and treated-unit settings.
- The remaining MTGP gap is not primarily bias. It is variance control in small or high-noise panels.
That makes the next technical target fairly specific: keep the scalar treatment-effect channel and the unpenalized additive fixed effects, but add stronger variance control for the MTGP surface. TROP’s target-local weighting is not a drop-in fix by itself; it helps in some Germany/Smoking settings but does not dominate the unweighted scalar-FE objective.