Adaptive Partial Pooling for Staggered Adoption Event Studies

Two-Way Cohort and Event-Time Pooling by Lepskii Selection

Author

Apoorva Lal

Published

April 28, 2026

Abstract

Staggered-adoption event studies face a bias-variance tradeoff. A pooled event-study regression is precise but can be biased when dynamic treatment effects differ across adoption cohorts or when the dynamic response has structure that is poorly represented by a single common curve. A saturated Sun-Abraham style regression is robust to arbitrary cohort-by-event-time heterogeneity, but can be noisy when many cells are weakly supported. This memo develops an adaptive partial-pooling estimator over two interpretable dimensions: pooling adjacent adoption cohorts and pooling adjacent event periods. The coarsest model pools cohorts and event periods heavily; the finest model is the saturated cohort-by-event-time event study. A Lepskii rule selects the coarsest model whose aggregate event-time ATT curve remains statistically indistinguishable from all finer refinements. The goal is to retain the bias protection of heterogeneity-robust event studies while recovering precision whenever treatment-effect heterogeneity is structured.

Executive Summary

The original joint-test idea asks a binary diagnostic question: can we reject the restriction that dynamic effects are common across cohorts? That is useful, but it is not the estimation problem researchers actually face. If cohort dynamics are not exactly homogeneous, we still need to decide how much heterogeneity to model. The fully saturated model protects against bias, but it may pay a large variance cost. The pooled model is precise, but fragile. The useful middle ground is a sequence of partially pooled event-study models.

This memo formalizes that sequence along two axes.

Cohort pooling: adjacent adoption cohorts are grouped into bins. One bin gives a pooled event study; one bin per cohort gives the saturated cohort-specific event study.
Event-time pooling: adjacent event times are grouped into bins or, more generally, approximated by a nested event-time basis. One or a few bins impose a smooth or low-dimensional dynamic response; one bin per event time gives the saturated event-time dummy basis.

Every candidate model is mapped to the same event-time ATT target. A Lepskii selector starts from simple models and accepts the first model whose target curve is statistically close to every finer model. The selected estimator is therefore not a pre-test between pooled and saturated. It is a data-dependent choice of the coarsest adequate representation of the cohort-by-event-time treatment-effect surface.

The core formal object is a two-dimensional grid:

\[ \mathcal{M} = \{M_{a,b}: a = 0,\ldots,A,\ b = 0,\ldots,B\}, \]

where $a$ indexes cohort-partition refinement and $b$ indexes event-time partition or basis refinement. The partial order is

\[ (a,b) \preceq (a',b') \quad \Longleftrightarrow \quad \mathcal{P}_{a'} \text{ refines } \mathcal{P}_{a} \text{ and } \mathcal{Q}_{b'} \text{ refines } \mathcal{Q}_{b}. \]

The selected model is the simplest $(a,b)$ such that, for all finer $(a',b') \succeq (a,b)$,

\[ \hat\tau_{a,b} - \hat\tau_{a',b'} \]

is small relative to its estimated sampling variation. The main implementation issue is estimating the joint covariance of the entire model path. The right production route is a cluster multiplier bootstrap or influence-function stack for all target estimates.

1. Motivation

Staggered-adoption event studies have two useful but imperfect endpoints. The pooled event-study regression estimates one dynamic treatment-effect curve for all treated cohorts. It is simple and often precise, but it can be biased for interpretable dynamic average treatment effects when treatment effects vary over adoption cohorts (Goodman-Bacon 2021; Chaisemartin and D’Haultfœuille 2020; Goldsmith-Pinkham, Hull, and Kolesár 2024). The fully saturated event-study regression estimates a separate dynamic curve for every adoption cohort and then aggregates those cohort-specific effects (Sun and Abraham 2021; Wooldridge 2021). Under parallel trends and no anticipation, this saturated approach removes cohort-heterogeneity bias, but it can pay a large variance cost.

The key observation is that most empirical treatment-effect heterogeneity is not arbitrary white noise over cohort and event time. It may be structured:

early adopters differ from late adopters;
adjacent adoption cohorts share similar exposure environments;
effects rise smoothly after adoption and then plateau;
short-run and long-run effects differ, but individual event-time coefficients are noisy;
only a subset of cohorts exhibits a distinct path.

A binary F-test discards this structure. It asks whether the pooled model is exactly true. The adaptive partial-pooling view asks instead: what is the coarsest cohort-time representation that captures all statistically detectable features of the dynamic ATT curve?

2. Relation to the Original Joint-Test Idea

The original joint-test paper proposes simple linear restrictions in a saturated event-study regression. In its cleanest form, estimate

\[ Y_{it} = \alpha_i + \lambda_t + \sum_{s \in \mathcal{S}} \gamma_s 1\{t-G_i=s\} + \sum_{g \ne g_0}\sum_{s \in \mathcal{S}} \delta_{g,s}1\{G_i=g\}1\{t-G_i=s\} + u_{it}, \]

and test

\[ H_0: \delta_{g,s}=0 \quad\text{for all tested }(g,s). \]

That test is a useful diagnostic for whether a pooled dynamic curve is visibly misspecified. The Lepskii estimator uses the same regression algebra, but it replaces the one-shot null with a nested family of restrictions. Instead of asking whether all deviations are zero, it asks whether the remaining differences between a candidate model and all finer refinements are small enough to be treated as sampling noise.

A useful slogan:

\[ \text{joint F-test} = \text{detect heterogeneity}, \qquad \text{Lepskii} = \text{choose the amount of heterogeneity to model}. \]

3. Setup

There are units $i=1,\ldots,n$ and periods $t=1,\ldots,T$. Let $D_{it}$ denote treatment status. Adoption is absorbing, so

\[ D_{it}=1\{t \ge G_i\}, \]

where $G_i \in \mathcal{G}\cup\{\infty\}$ is the first treatment period and $G_i=\infty$ denotes never treated. Let event time be

\[ s_{it}=t-G_i. \]

For treated adoption cohorts, define potential outcomes $Y_{it}(g)$ as the outcome at time $t$ if unit $i$ first receives treatment in period $g$, and $Y_{it}(\infty)$ as the untreated path.

3.1 Assumptions

The memo targets the usual event-study estimands under standard identifying conditions.

Assumption 1: No anticipation. For all $t<g$,

\[ Y_{it}(g)=Y_{it}(\infty). \]

Assumption 2: Parallel trends for untreated potential outcomes. After conditioning on unit and time effects, the untreated potential outcome path for cohort $g$ evolves like the comparison cohorts used for that event-time cell. The exact comparison group can be never-treated units, not-yet-treated units, or the comparison set implied by a saturated interaction regression.

Assumption 3: Overlap. Every cohort-event-time cell that contributes to the target has support and a valid comparison set.

Assumption 4: Cluster asymptotics. Units are independent across clusters, and the number of clusters grows. Time dimension may be fixed or moderate. Cluster-robust or multiplier-bootstrap covariance estimators are used for inference.

These assumptions are intentionally orthogonal to the pooling problem. Pooling is not an identifying assumption by itself; it is a regularization choice for estimating the identified cohort-by-event-time surface.

3.2 Cohort-Time Effects and Target

Let the unique treated adoption dates be

\[ \mathcal{G} = \{g_1 < g_2 < \cdots < g_J\}, \]

and let $\mathcal{S}$ denote the event-time support, excluding the reference period $s=-1$. Define the cohort-specific dynamic effect

\[ \theta_{j,s} = E[Y_{it}(g_j)-Y_{it}(\infty) \mid G_i=g_j,\ t-g_j=s]. \]

Stack the full surface as

\[ \theta = (\theta_{1,s_1},\ldots,\theta_{J,s_{|\mathcal{S}|}})'. \]

The event-time ATT target is a weighted aggregate,

\[ \tau_s = \sum_{j=1}^J w_{j,s}\theta_{j,s}, \]

where $w_{j,s}$ is usually the share of treated observations from cohort $j$ that contribute to event time $s$. In matrix form,

\[ \tau = L\theta. \]

The matrix $L$ is known once the target population and event-time support are chosen.

4. Two-Way Pooling Geometry

The main estimator is built from two nested restrictions on $\theta$: one over cohorts and one over event time.

4.1 Cohort Partitions

Let $\mathcal{P}_a$ be a partition of treated cohorts into adjacent adoption blocks:

\[ \mathcal{P}_a = \{B_{a,1},\ldots,B_{a,K_a}\}, \]

where each $B_{a,k}$ is a consecutive set of adoption dates. The path is nested:

\[ \mathcal{P}_{a+1}\text{ refines }\mathcal{P}_{a}. \]

The coarsest partition has $K_0=1$ and pools all treated cohorts. The finest partition has $K_A=J$ and separates every adoption cohort.

Concrete example with nine cohorts:

\[ K_a \in \{1,2,3,5,9\}. \]

The adjacent-block restriction is substantively interpretable: nearby adopters may share policy environments, exposure conditions, or implementation regimes.

4.2 Event-Time Partitions

The new piece is to give event time the same treatment. Let $\mathcal{Q}_b$ be a partition of event times into adjacent event-time blocks:

\[ \mathcal{Q}_b = \{C_{b,1},\ldots,C_{b,R_b}\}, \]

with

\[ \mathcal{Q}_{b+1}\text{ refines }\mathcal{Q}_{b}. \]

The coarsest event-time partition might be something like

\[ \{-5,-4,-3,-2\},\quad \{0,1\},\quad \{2,3,4\},\quad \{5,6,\ldots\}, \]

while the finest partition separates every event time. Event-time pooling is useful when the dynamic response is smooth, monotone, plateauing, or naturally summarized by short-run versus medium-run versus long-run effects.

Event-time partitions are the simplest implementation because they preserve the linear-regression dummy structure. More generally, $\mathcal{Q}_b$ can be replaced by a nested event-time basis

\[ \Phi_b(s)=\left(\phi_{b,1}(s),\ldots,\phi_{b,R_b}(s)\right)', \]

with

\[ \operatorname{span}(\Phi_b) \subseteq \operatorname{span}(\Phi_{b+1}). \]

Piecewise constants, splines, polynomials, or saturated event-time dummies are all admissible. The partition case is recovered by setting basis functions equal to indicators for event-time blocks.

4.3 Tensor Product Model Class

For a cohort partition $\mathcal{P}_a$ and event-time partition $\mathcal{Q}_b$, model $M_{a,b}$ imposes that effects are constant within each rectangle

\[ B_{a,k} \times C_{b,r}. \]

Thus,

\[ \theta_{j,s}^{a,b} = \eta_{k,r} \quad\text{if}\quad g_j \in B_{a,k},\ s\in C_{b,r}. \]

Equivalently, define a restriction matrix $H_{a,b}$ such that

\[ \theta = H_{a,b}\eta_{a,b} + \rho_{a,b}, \]

where $\eta_{a,b}$ is the lower-dimensional parameter vector and $\rho_{a,b}$ is approximation error. If $M_{a,b}$ is correctly specified, $\rho_{a,b}=0$.

For the nested-basis version, the tensor product representation is

\[ \theta_{j,s}^{a,b} = \sum_{k=1}^{K_a}\sum_{r=1}^{R_b} \eta_{k,r} 1\{g_j\in B_{a,k}\}\phi_{b,r}(s). \]

The dimension is roughly

\[ \dim(M_{a,b}) = K_a R_b, \]

up to normalizations and unsupported cells.

4.4 Regression Form

The corresponding regression is

\[ Y_{it} = \alpha_i + \lambda_t + \sum_{k=1}^{K_a}\sum_{r=1}^{R_b} \eta_{k,r} 1\{G_i\in B_{a,k}\}\psi_{b,r}(s_{it}) +u_{it}, \]

where $\psi_{b,r}$ is either an event-time-bin indicator or a basis function. The reference period $s=-1$ is omitted. Never-treated and not-yet-treated observations serve as comparisons according to the chosen event-study design.

This regression nests the standard cases:

Pooled event study: $K_a=1$, saturated event-time dummies.
Static pooled DiD: $K_a=1$, one post-treatment event-time bin.
Sun-Abraham saturated event study: $K_a=J$, saturated event-time dummies.
Cohort-binned event study: $1<K_a<J$, saturated event-time dummies.
Two-way pooled event study: $1<K_a<J$ and $1<R_b<|\mathcal{S}|$.

For every model, map fitted coefficients into a full cohort-by-event-time surface and then into the common target:

\[ \hat\theta_{a,b}=H_{a,b}\hat\eta_{a,b}, \qquad \hat\tau_{a,b}=L\hat\theta_{a,b}. \]

5. Oracle Benchmark

Let $V_{a,b}=\operatorname{Var}(\hat\tau_{a,b})$ and define the approximation bias

\[ B_{a,b}=E[\hat\tau_{a,b}]-\tau. \]

For a positive semidefinite weighting matrix $W$, the oracle model solves

\[ (a^*,b^*)= \operatorname*{arg\,min}_{(a,b)\in\mathcal{M}} \left\{ B_{a,b}'WB_{a,b} + \operatorname{tr}(WV_{a,b}) \right\}. \]

The bias term falls as the model is refined. The variance term rises as the model is refined. The oracle is the model at which the marginal bias reduction no longer justifies the variance cost.

Since $B_{a,b}$ is unknown, the oracle cannot be implemented directly. Lepskii’s method estimates its logic indirectly: if a simple model differs from finer models by more than sampling noise, it is biased; if not, the additional flexibility is not empirically warranted.

6. Lepskii Selector on a Two-Dimensional Partial Order

6.1 Pairwise Comparisons

For two comparable models $(a,b) \preceq (a',b')$, define

\[ delta_{(a,b),(a',b')} = \hat\tau_{a,b} - \hat\tau_{a',b'}. \]

Its covariance is

\[ S_{(a,b),(a',b')} = V_{a,b}+V_{a',b'} -C_{(a,b),(a',b')}-C_{(a,b),(a',b')}', \]

where

\[ C_{(a,b),(a',b')}= \operatorname{Cov}(\hat\tau_{a,b},\hat\tau_{a',b'}). \]

The cross-covariance matters because all candidate models are fitted on the same data. Ignoring it is not a principled final implementation, though it can be a useful first diagnostic.

6.2 Acceptance Set

Candidate $(a,b)$ is accepted if it is close to all finer models:

\[ \mathcal{A}_{a,b}=1 \left\{ T_{(a,b),(a',b')} \le c_{(a,b),(a',b')} \text{ for all }(a',b')\succeq(a,b) \right\}. \]

A Wald version uses

\[ T_{(a,b),(a',b')} = \delta_{(a,b),(a',b')}' S_{(a,b),(a',b')}^{-1} \delta_{(a,b),(a',b')}. \]

A curve-wise sup-$t$ version uses

\[ T_{(a,b),(a',b')} = \max_{s\in\mathcal{S}_0} \left| \frac{\hat\tau_{a,b,s}-\hat\tau_{a',b',s}} {\widehat{se}(\hat\tau_{a,b,s}-\hat\tau_{a',b',s})} \right|, \]

where $\mathcal{S}_0$ is the event-time set used for selection. It may exclude lead periods, the reference period, or very weakly supported tail periods.

6.3 Selection Rule

The selected model is

\[ (\hat a,\hat b) = \operatorname*{arg\,min}_{(a,b)\in\mathcal{M}} \left\{ \operatorname{complexity}(a,b): \mathcal{A}_{a,b}=1 \right\}. \]

A natural complexity score is

\[ \operatorname{complexity}(a,b)=K_aR_b. \]

Tie-breakers matter because the grid is partially ordered, not totally ordered. A default deterministic rule is:

minimize $K_aR_b$;
among ties, prefer the coarser cohort partition;
among remaining ties, prefer the coarser event-time partition;
if still tied, choose the model with smaller estimated variance $\operatorname{tr}(\widehat V_{a,b})$.

The ordering can be changed if the application has a prior preference. For example, one might prefer event-time smoothness before cohort pooling if the substantive object is a smooth dynamic response.

6.4 Why Compare to All Finer Models?

Comparing only to the saturated model is tempting, but too blunt. A saturated model can be noisy, so the simple-versus-saturated difference may look small even when an intermediate model detects structured bias. Lepskii’s logic compares a candidate to the entire finer path. This helps detect cases where the first relevant refinement is, say, early-versus-late cohorts or short-versus-long-run event-time bins, rather than the fully saturated surface.

7. Covariance and Calibration

The production estimator should estimate the covariance of the whole path. There are two natural routes.

7.1 Influence-Function Stack

If each target estimate admits an asymptotic linear representation,

\[ \sqrt n(\hat\tau_{a,b}-\tau_{a,b}) = \frac{1}{\sqrt n}\sum_{i=1}^n \varphi_{i,a,b}+o_p(1), \]

then stack all influence functions:

\[ \varphi_i = (\varphi_{i,0,0}',\varphi_{i,0,1}',\ldots,\varphi_{i,A,B}')'. \]

The covariance of any comparison is obtained by linear transformation of

\[ \widehat\Omega = \frac{1}{n}\sum_{i=1}^n \hat\varphi_i\hat\varphi_i'. \]

For fixed-effects least squares with cluster-robust inference, the influence functions can be assembled from residualized regressors and cluster scores.

7.2 Cluster Multiplier Bootstrap

The more implementation-friendly route is a cluster multiplier bootstrap. For bootstrap draw $r$, draw cluster weights $\xi_i^{(r)}$ with mean zero and unit variance. Recompute the model-path score perturbation and produce

\[ \hat\tau_{a,b}^{*(r)} \]

for every candidate $(a,b)$. Then compute the bootstrap analogue of the full selection statistic:

\[ Z^{*(r)} = \max_{(a,b)}\max_{(a',b')\succeq(a,b)}\max_{s\in\mathcal{S}_0} \left| \frac{ (\hat\tau_{a,b,s}^{*(r)}-\hat\tau_{a',b',s}^{*(r)}) -(\hat\tau_{a,b,s}-\hat\tau_{a',b',s}) }{ \widehat{se}(\hat\tau_{a,b,s}-\hat\tau_{a',b',s}) } \right|. \]

Set $c$ to a high quantile of $Z^{*(r)}$, such as the $1-\alpha$ quantile. This calibrates simultaneously over models, refinements, and event times.

7.3 Conditional Versus Post-Selection Inference

The selected curve

\[ \hat\tau_{\hat a,\hat b} \]

is a post-selection estimator. Model-conditional standard errors understate uncertainty if interpreted as unconditional confidence bands. There are three practical inference levels:

Selection-only memo: report the selected model and model-conditional bands, clearly labeled as conditional.
Sample splitting: use one part of the data to select $(\hat a,\hat b)$ and another part to estimate the selected curve. This is simple but inefficient.
Bootstrap-after-selection: rerun the selection rule inside each bootstrap draw and use the distribution of selected estimates. This is more expensive but targets the actual adaptive procedure.

For the initial research memo, level 1 is enough. For a paper claim, level 3 is the right target.

8. Algorithm

Input: panel data, adoption dates $G_i$, event-time support $\mathcal{S}$, cohort partition path $\{\mathcal{P}_a\}_{a=0}^A$, event-time partition or basis path $\{\mathcal{Q}_b\}_{b=0}^B$, target weights $L$, comparison set $\mathcal{S}_0$, threshold rule.

Step 1: Fit candidate models. For every $(a,b)$, fit the fixed-effects regression implied by cohort blocks $\mathcal{P}_a$ and event-time blocks or basis $\mathcal{Q}_b$.

Step 2: Common target map. Convert each model estimate to the full cohort-by-event-time surface $\hat\theta_{a,b}$ and target curve $\hat\tau_{a,b}=L\hat\theta_{a,b}$.

Step 3: Joint covariance. Estimate covariance for all differences $\hat\tau_{a,b}-\hat\tau_{a',b'}$ using either a path-level influence-function stack or cluster multiplier bootstrap.

Step 4: Lepskii acceptance. Mark $(a,b)$ acceptable if all comparisons to finer models are below the calibrated threshold.

Step 5: Select. Choose the acceptable model with minimal complexity and the pre-specified tie-breaker.

Step 6: Report. Report selected cohort bins, selected event-time bins, selected target curve, comparison diagnostics, and sensitivity to the threshold and candidate path.

9. Implementation Details for the Current Codebase

The current proof-of-concept already implements the cohort-pooling axis. The path is

\[ K_a\in\{1,2,3,5,9\}, \]

with a saturated event-time dummy basis. To complete the two-way version, add an event-time path such as

\[ R_b\in\{3,5,9,|\mathcal{S}|\}. \]

A concrete first event-time path for leads $-5,\ldots,-2$ and lags $0,\ldots,12$ could be:

$R=3$: leads, short-run lags $0$–$2$, medium/long-run lags $3+$;
$R=5$: leads, $0$–$1$, $2$–$4$, $5$–$8$, $9+$;
$R=9$: narrower adjacent lag bins;
saturated: one dummy per event time.

For each $(K,R)$ pair, the design matrix uses indicators for cohort-bin by event-time-bin rectangles. The model should still produce a full event-time curve by expanding each event-time bin coefficient back onto its constituent event times before applying $L$.

A good software interface is:

fit = LepskiiEventStudy(
    cohort_bins=[1, 2, 3, 5, 9],
    event_time_bins=[3, 5, 9, "saturated"],
    selection_event_times=range(0, 13),
    threshold="multiplier-bootstrap",
    cluster="unit_id",
)
fit.fit(df)
fit.selection_summary()
fit.target_curve()

The internal result table should contain one row per dgp × cohort_bin_count × event_bin_count × event_time, with columns:

estimate;
standard error;
target weight denominator;
active cohorts;
selected cohort-bin count;
selected event-bin count;
comparison statistic against each finer model.

10. Numerical Experiment: Current One-Axis Prototype

The current numerical experiment uses nine treated adoption cohorts with uneven cohort sizes. The candidate path is

\[ 1,\ 2,\ 3,\ 5,\ 9 \]

adjacent adoption-cohort bins, where one bin is the naive pooled event study and nine bins is the saturated endpoint. In the current prototype, all specifications use the same saturated event-time basis and are mapped into the same aggregate event-study target. The next implementation step is to cross this cohort path with the event-time path described above.

The gray curves in the existing figure are the true cohort-specific treatment effects. The black curve aggregates those heterogeneous curves into the true event-study target using treated cohort sizes at each event time. The colored curves show the naive pooled estimate, the saturated aggregate estimate, and the Lepskii-selected estimate. The bands are model-conditional pointwise 95 percent intervals; they do not yet account for selection uncertainty.

Many-cohort DGPs with true aggregate event-study targets, naive pooled estimates, saturated aggregate estimates, and Lepskii-selected estimates. Current prototype varies only cohort pooling; the proposed extension also varies event-time pooling.

The realized selections in this one-axis draw are:

Selected number of adjacent cohort bins in the current one-axis experiment.
	DGP	Latent heterogeneity	Selected cohort bins	Candidate cohort bins
0	Common path	one latent response type	1	1, 2, 3, 5, 9
1	Early vs late	two adjacent adoption regimes	5	1, 2, 3, 5, 9
2	Local window shock	localized adoption-window shock	1	1, 2, 3, 5, 9
3	Log vs shark vs sin	three F-test-style shape families	2	1, 2, 3, 5, 9
4	No pooling: signed paths	every cohort has a distinct sign and path shape	5	1, 2, 3, 5, 9
5	No pooling: peak timing	every cohort has a distinct peak time and width	1	1, 2, 3, 5, 9
6	No pooling: signatures	every cohort has a distinct response signature	5	1, 2, 3, 5, 9
7	Response timing	same long-run level, different response timing	1	1, 2, 3, 5, 9
8	Smooth dose gradient	continuous timing-gradient heterogeneity	2	1, 2, 3, 5, 9
9	Three epochs	three adjacent adoption regimes	3	1, 2, 3, 5, 9

Because the true aggregate target is known in the simulation, we can compare realized mean squared error against the true event-study curve. The metric below averages squared error over post-treatment event times,

\[ \operatorname{MSE}_d(a) = \frac{1}{|\mathcal{S}_+|} \sum_{s\in\mathcal{S}_+} \left(\hat\tau_{d,a,s}-\tau_{d,s}\right)^2, \]

where $d$ indexes the DGP and $a$ is the estimator. This is not an oracle risk over repeated samples; it is an in-draw diagnostic.

Post-treatment MSE against the true aggregate event-study target in the current one-axis prototype.

Post-treatment MSE by DGP and estimator in the current one-axis prototype.
model	DGP	Latent heterogeneity	Lepskii MSE	Naive MSE	Saturated MSE
0	Common path	one latent response type	0.000322	0.000322	0.000203
1	Early vs late	two adjacent adoption regimes	0.000492	0.019401	0.000715
2	Local window shock	localized adoption-window shock	0.001970	0.001970	0.000524
3	Log vs shark vs sin	three F-test-style shape families	0.001830	0.007058	0.001073
4	No pooling: peak timing	every cohort has a distinct peak time and width	0.002344	0.002344	0.001247
5	No pooling: signatures	every cohort has a distinct response signature	0.001302	0.002777	0.000750
6	No pooling: signed paths	every cohort has a distinct sign and path shape	0.009632	0.004038	0.004059
7	Response timing	same long-run level, different response timing	0.000892	0.000892	0.004079
8	Smooth dose gradient	continuous timing-gradient heterogeneity	0.009507	0.015462	0.002215
9	Three epochs	three adjacent adoption regimes	0.000321	0.036294	0.000696

The current results show the mechanism works in simple cases but also reveal why the two-way version matters. If the true surface varies smoothly over event time, forcing a saturated event-time basis can make all cohort comparisons noisy. If the true surface varies over cohorts but is simple over event time, event-time pooling can recover power for detecting the relevant cohort structure. The reverse is also possible: if cohorts are similar but dynamics are nonlinear, cohort pooling should remain coarse while event-time pooling refines.

11. Simulation Designs Needed for the Two-Way Version

The next simulation suite should include DGPs that separately stress the two axes.

11.1 Cohort-Only Structure

Cohorts differ by adoption regime, but each regime has a smooth common dynamic shape. The oracle should select multiple cohort bins but few event-time bins.

11.2 Time-Only Structure

All cohorts share a dynamic path with short-run, medium-run, and long-run components. The oracle should select one cohort bin but several event-time bins.

11.3 Tensor Structure

Early and late cohorts differ only in long-run effects, not short-run effects. The oracle should select both cohort and event-time bins, but not the fully saturated surface.

11.4 Local Shock

A narrow adoption-window shock affects a subset of cohorts and a subset of event times. This tests whether rectangular bins are too crude and whether the grid needs localized refinements.

11.5 No Pooling

Every cohort has a distinct, non-smooth event-time path. The oracle should move toward the saturated endpoint on both axes.

12. What Needs to Be Proved

A formal paper needs at least four pieces.

12.1 Identification of the Saturated Target

Under no anticipation, parallel trends, and overlap, the saturated regression or chosen heterogeneity-robust estimator identifies $\theta_{j,s}$ for all supported cells. Therefore $L\theta$ is identified.

12.2 Projection Bias of Restricted Models

For each restricted model $M_{a,b}$, characterize the estimand as a projection of the saturated surface:

\[ \theta_{a,b}^{proj} = H_{a,b}(H_{a,b}'\Omega H_{a,b})^{-1}H_{a,b}'\Omega\theta, \]

for an appropriate design-weight matrix $\Omega$. The target bias is

\[ B_{a,b}=L(\theta_{a,b}^{proj}-\theta). \]

This makes clear that partially pooled models are biased only to the extent that the true surface is not well approximated by the selected rectangles or basis.

12.3 Uniform Asymptotic Linearity Over the Path

Show that, uniformly over the finite candidate grid,

\[ \sqrt n(\hat\tau_{a,b}-\tau_{a,b}^{proj}) = \frac{1}{\sqrt n}\sum_i \varphi_{i,a,b}+o_p(1). \]

For a fixed finite grid, this should follow from standard cluster-robust least squares arguments plus uniform nonsingularity of the residualized design.

12.4 Lepskii Oracle Inequality

With calibrated threshold $c_n$, prove a finite-grid oracle statement of the form

\[ \|\hat\tau_{\hat a,\hat b}-\tau\|_W^2 \le C\min_{a,b}\left\{ \|B_{a,b}\|_W^2 + \operatorname{pen}_{a,b} \right\} + r_n, \]

with high probability, where $\operatorname{pen}_{a,b}$ is a variance or critical-value penalty and $r_n$ is small. This is the formal version of the bias-variance story.

13. Practical Reporting Template

An empirical implementation should report:

the selected number of cohort bins and event-time bins;
the actual cohort and event-time partitions;
the selected event-study curve;
the pooled and saturated curves for reference;
a heatmap of accepted/rejected candidate models over the $(K,R)$ grid;
comparison statistics that drove selection;
threshold sensitivity;
whether bands are model-conditional or selection-adjusted.

This reporting is important because the selected model is itself substantive. If the procedure selects early-versus-late cohorts and short-run-versus-long-run event-time bins, that is a concise empirical summary of the heterogeneity.

14. Discussion

Adaptive partial pooling reframes event-study heterogeneity as an estimation problem rather than a binary diagnostic. The saturated estimator provides bias protection under arbitrary cohort-by-event-time heterogeneity, but it can be noisy. The pooled event study is precise, but fragile when cohort dynamics differ. A Lepskii selector over nested cohort and event-time restrictions gives a data-dependent middle ground: estimate the target curve with the coarsest model that remains statistically close to more flexible alternatives.

The two-axis version is important. Cohort pooling alone cannot exploit smooth or piecewise-constant event-time structure. Event-time pooling alone cannot handle adoption-regime heterogeneity. The tensor grid lets the data choose a simple surface when simple is adequate and a richer surface when the evidence demands it.

15. Next Implementation Steps

Add event-time partition paths to code/lepskii.py.
Refactor the candidate-model output so every model expands to the same full event-time grid before aggregation.
Cross the existing cohort path [1, 2, 3, 5, 9] with event-time paths such as [3, 5, 9, "saturated"].
Replace the diagonal comparison variance with a cluster multiplier bootstrap over the whole path.
Add grid heatmaps for selected/accepted models and comparison statistics.
Run the two-axis simulation designs in Section 11.

The immediate research claim should be modest: the method is an adaptive, interpretable way to move between pooled and saturated event studies. The strong paper claim requires the bootstrap calibration and an oracle-style finite-grid risk result.

References

Chaisemartin, Clément de, and Xavier D’Haultfœuille. 2020. “Two-Way Fixed Effects Estimators with Heterogeneous Treatment Effects.” American Economic Review 110 (9): 2964–96.

Goldsmith-Pinkham, Paul, Peter Hull, and Michal Kolesár. 2024. “Contamination Bias in Linear Regressions.” American Economic Review 114 (12): 4015–51.

Goodman-Bacon, Andrew. 2021. “Difference-in-Differences with Variation in Treatment Timing.” Journal of Econometrics 225 (2): 254–77.

Sun, Liyang, and Sarah Abraham. 2021. “Estimating Dynamic Treatment Effects in Event Studies with Heterogeneous Treatment Effects.” Journal of Econometrics 225 (2): 175–99.

Wooldridge, Jeffrey M. 2021. “Two-Way Fixed Effects, the Two-Way Mundlak Regression, and Difference-in-Differences Estimators.” Working Paper.

--- title: "Adaptive Partial Pooling for Staggered Adoption Event Studies" subtitle: "Two-Way Cohort and Event-Time Pooling by Lepskii Selection" author: "Apoorva Lal" date: today bibliography: references.bib abstract: | Staggered-adoption event studies face a bias-variance tradeoff. A pooled event-study regression is precise but can be biased when dynamic treatment effects differ across adoption cohorts or when the dynamic response has structure that is poorly represented by a single common curve. A saturated Sun-Abraham style regression is robust to arbitrary cohort-by-event-time heterogeneity, but can be noisy when many cells are weakly supported. This memo develops an adaptive partial-pooling estimator over two interpretable dimensions: pooling adjacent adoption cohorts and pooling adjacent event periods. The coarsest model pools cohorts and event periods heavily; the finest model is the saturated cohort-by-event-time event study. A Lepskii rule selects the coarsest model whose aggregate event-time ATT curve remains statistically indistinguishable from all finer refinements. The goal is to retain the bias protection of heterogeneity-robust event studies while recovering precision whenever treatment-effect heterogeneity is structured. format: html: toc: true toc-depth: 3 code-fold: true code-tools: true embed-resources: true page-layout: full execute: enabled: true echo: false warning: false message: false --- ## Executive Summary The original joint-test idea asks a binary diagnostic question: can we reject the restriction that dynamic effects are common across cohorts? That is useful, but it is not the estimation problem researchers actually face. If cohort dynamics are not exactly homogeneous, we still need to decide *how much* heterogeneity to model. The fully saturated model protects against bias, but it may pay a large variance cost. The pooled model is precise, but fragile. The useful middle ground is a sequence of partially pooled event-study models. This memo formalizes that sequence along two axes. 1. **Cohort pooling:** adjacent adoption cohorts are grouped into bins. One bin gives a pooled event study; one bin per cohort gives the saturated cohort-specific event study. 2. **Event-time pooling:** adjacent event times are grouped into bins or, more generally, approximated by a nested event-time basis. One or a few bins impose a smooth or low-dimensional dynamic response; one bin per event time gives the saturated event-time dummy basis. Every candidate model is mapped to the same event-time ATT target. A Lepskii selector starts from simple models and accepts the first model whose target curve is statistically close to every finer model. The selected estimator is therefore not a pre-test between pooled and saturated. It is a data-dependent choice of the coarsest adequate representation of the cohort-by-event-time treatment-effect surface. The core formal object is a two-dimensional grid: $$ \mathcal{M} = \{M_{a,b}: a = 0,\ldots,A,\ b = 0,\ldots,B\}, $$ where $a$ indexes cohort-partition refinement and $b$ indexes event-time partition or basis refinement. The partial order is $$ (a,b) \preceq (a',b') \quad \Longleftrightarrow \quad \mathcal{P}_{a'} \text{ refines } \mathcal{P}_{a} \text{ and } \mathcal{Q}_{b'} \text{ refines } \mathcal{Q}_{b}. $$ The selected model is the simplest $(a,b)$ such that, for all finer $(a',b') \succeq (a,b)$, $$ \hat\tau_{a,b} - \hat\tau_{a',b'} $$ is small relative to its estimated sampling variation. The main implementation issue is estimating the joint covariance of the entire model path. The right production route is a cluster multiplier bootstrap or influence-function stack for all target estimates. ## 1. Motivation Staggered-adoption event studies have two useful but imperfect endpoints. The pooled event-study regression estimates one dynamic treatment-effect curve for all treated cohorts. It is simple and often precise, but it can be biased for interpretable dynamic average treatment effects when treatment effects vary over adoption cohorts [@GoodmanBacon2021; @DeChaisemartinDHaultfoeuille2020; @GoldsmithPinkhamHullKolesar2024]. The fully saturated event-study regression estimates a separate dynamic curve for every adoption cohort and then aggregates those cohort-specific effects [@AbrahamSun2021; @Wooldridge2021]. Under parallel trends and no anticipation, this saturated approach removes cohort-heterogeneity bias, but it can pay a large variance cost. The key observation is that most empirical treatment-effect heterogeneity is not arbitrary white noise over cohort and event time. It may be structured: - early adopters differ from late adopters; - adjacent adoption cohorts share similar exposure environments; - effects rise smoothly after adoption and then plateau; - short-run and long-run effects differ, but individual event-time coefficients are noisy; - only a subset of cohorts exhibits a distinct path. A binary F-test discards this structure. It asks whether the pooled model is exactly true. The adaptive partial-pooling view asks instead: what is the coarsest cohort-time representation that captures all statistically detectable features of the dynamic ATT curve? ## 2. Relation to the Original Joint-Test Idea The original joint-test paper proposes simple linear restrictions in a saturated event-study regression. In its cleanest form, estimate $$ Y_{it} = \alpha_i + \lambda_t + \sum_{s \in \mathcal{S}} \gamma_s 1\{t-G_i=s\} + \sum_{g \ne g_0}\sum_{s \in \mathcal{S}} \delta_{g,s}1\{G_i=g\}1\{t-G_i=s\} + u_{it}, $$ and test $$ H_0: \delta_{g,s}=0 \quad\text{for all tested }(g,s). $$ That test is a useful diagnostic for whether a pooled dynamic curve is visibly misspecified. The Lepskii estimator uses the same regression algebra, but it replaces the one-shot null with a nested family of restrictions. Instead of asking whether all deviations are zero, it asks whether the remaining differences between a candidate model and all finer refinements are small enough to be treated as sampling noise. A useful slogan: $$ \text{joint F-test} = \text{detect heterogeneity}, \qquad \text{Lepskii} = \text{choose the amount of heterogeneity to model}. $$ ## 3. Setup There are units $i=1,\ldots,n$ and periods $t=1,\ldots,T$. Let $D_{it}$ denote treatment status. Adoption is absorbing, so $$ D_{it}=1\{t \ge G_i\}, $$ where $G_i \in \mathcal{G}\cup\{\infty\}$ is the first treatment period and $G_i=\infty$ denotes never treated. Let event time be $$ s_{it}=t-G_i. $$ For treated adoption cohorts, define potential outcomes $Y_{it}(g)$ as the outcome at time $t$ if unit $i$ first receives treatment in period $g$, and $Y_{it}(\infty)$ as the untreated path. ### 3.1 Assumptions The memo targets the usual event-study estimands under standard identifying conditions. **Assumption 1: No anticipation.** For all $t<g$, $$ Y_{it}(g)=Y_{it}(\infty). $$ **Assumption 2: Parallel trends for untreated potential outcomes.** After conditioning on unit and time effects, the untreated potential outcome path for cohort $g$ evolves like the comparison cohorts used for that event-time cell. The exact comparison group can be never-treated units, not-yet-treated units, or the comparison set implied by a saturated interaction regression. **Assumption 3: Overlap.** Every cohort-event-time cell that contributes to the target has support and a valid comparison set. **Assumption 4: Cluster asymptotics.** Units are independent across clusters, and the number of clusters grows. Time dimension may be fixed or moderate. Cluster-robust or multiplier-bootstrap covariance estimators are used for inference. These assumptions are intentionally orthogonal to the pooling problem. Pooling is not an identifying assumption by itself; it is a regularization choice for estimating the identified cohort-by-event-time surface. ### 3.2 Cohort-Time Effects and Target Let the unique treated adoption dates be $$ \mathcal{G} = \{g_1 < g_2 < \cdots < g_J\}, $$ and let $\mathcal{S}$ denote the event-time support, excluding the reference period $s=-1$. Define the cohort-specific dynamic effect $$ \theta_{j,s} = E[Y_{it}(g_j)-Y_{it}(\infty) \mid G_i=g_j,\ t-g_j=s]. $$ Stack the full surface as $$ \theta = (\theta_{1,s_1},\ldots,\theta_{J,s_{|\mathcal{S}|}})'. $$ The event-time ATT target is a weighted aggregate, $$ \tau_s = \sum_{j=1}^J w_{j,s}\theta_{j,s}, $$ where $w_{j,s}$ is usually the share of treated observations from cohort $j$ that contribute to event time $s$. In matrix form, $$ \tau = L\theta. $$ The matrix $L$ is known once the target population and event-time support are chosen. ## 4. Two-Way Pooling Geometry The main estimator is built from two nested restrictions on $\theta$: one over cohorts and one over event time. ### 4.1 Cohort Partitions Let $\mathcal{P}_a$ be a partition of treated cohorts into adjacent adoption blocks: $$ \mathcal{P}_a = \{B_{a,1},\ldots,B_{a,K_a}\}, $$ where each $B_{a,k}$ is a consecutive set of adoption dates. The path is nested: $$ \mathcal{P}_{a+1}\text{ refines }\mathcal{P}_{a}. $$ The coarsest partition has $K_0=1$ and pools all treated cohorts. The finest partition has $K_A=J$ and separates every adoption cohort. Concrete example with nine cohorts: $$ K_a \in \{1,2,3,5,9\}. $$ The adjacent-block restriction is substantively interpretable: nearby adopters may share policy environments, exposure conditions, or implementation regimes. ### 4.2 Event-Time Partitions The new piece is to give event time the same treatment. Let $\mathcal{Q}_b$ be a partition of event times into adjacent event-time blocks: $$ \mathcal{Q}_b = \{C_{b,1},\ldots,C_{b,R_b}\}, $$ with $$ \mathcal{Q}_{b+1}\text{ refines }\mathcal{Q}_{b}. $$ The coarsest event-time partition might be something like $$ \{-5,-4,-3,-2\},\quad \{0,1\},\quad \{2,3,4\},\quad \{5,6,\ldots\}, $$ while the finest partition separates every event time. Event-time pooling is useful when the dynamic response is smooth, monotone, plateauing, or naturally summarized by short-run versus medium-run versus long-run effects. Event-time partitions are the simplest implementation because they preserve the linear-regression dummy structure. More generally, $\mathcal{Q}_b$ can be replaced by a nested event-time basis $$ \Phi_b(s)=\left(\phi_{b,1}(s),\ldots,\phi_{b,R_b}(s)\right)', $$ with $$ \operatorname{span}(\Phi_b) \subseteq \operatorname{span}(\Phi_{b+1}). $$ Piecewise constants, splines, polynomials, or saturated event-time dummies are all admissible. The partition case is recovered by setting basis functions equal to indicators for event-time blocks. ### 4.3 Tensor Product Model Class For a cohort partition $\mathcal{P}_a$ and event-time partition $\mathcal{Q}_b$, model $M_{a,b}$ imposes that effects are constant within each rectangle $$ B_{a,k} \times C_{b,r}. $$ Thus, $$ \theta_{j,s}^{a,b} = \eta_{k,r} \quad\text{if}\quad g_j \in B_{a,k},\ s\in C_{b,r}. $$ Equivalently, define a restriction matrix $H_{a,b}$ such that $$ \theta = H_{a,b}\eta_{a,b} + \rho_{a,b}, $$ where $\eta_{a,b}$ is the lower-dimensional parameter vector and $\rho_{a,b}$ is approximation error. If $M_{a,b}$ is correctly specified, $\rho_{a,b}=0$. For the nested-basis version, the tensor product representation is $$ \theta_{j,s}^{a,b} = \sum_{k=1}^{K_a}\sum_{r=1}^{R_b} \eta_{k,r} 1\{g_j\in B_{a,k}\}\phi_{b,r}(s). $$ The dimension is roughly $$ \dim(M_{a,b}) = K_a R_b, $$ up to normalizations and unsupported cells. ### 4.4 Regression Form The corresponding regression is $$ Y_{it} = \alpha_i + \lambda_t + \sum_{k=1}^{K_a}\sum_{r=1}^{R_b} \eta_{k,r} 1\{G_i\in B_{a,k}\}\psi_{b,r}(s_{it}) +u_{it}, $$ where $\psi_{b,r}$ is either an event-time-bin indicator or a basis function. The reference period $s=-1$ is omitted. Never-treated and not-yet-treated observations serve as comparisons according to the chosen event-study design. This regression nests the standard cases: - **Pooled event study:** $K_a=1$, saturated event-time dummies. - **Static pooled DiD:** $K_a=1$, one post-treatment event-time bin. - **Sun-Abraham saturated event study:** $K_a=J$, saturated event-time dummies. - **Cohort-binned event study:** $1<K_a<J$, saturated event-time dummies. - **Two-way pooled event study:** $1<K_a<J$ and $1<R_b<|\mathcal{S}|$. For every model, map fitted coefficients into a full cohort-by-event-time surface and then into the common target: $$ \hat\theta_{a,b}=H_{a,b}\hat\eta_{a,b}, \qquad \hat\tau_{a,b}=L\hat\theta_{a,b}. $$ ## 5. Oracle Benchmark Let $V_{a,b}=\operatorname{Var}(\hat\tau_{a,b})$ and define the approximation bias $$ B_{a,b}=E[\hat\tau_{a,b}]-\tau. $$ For a positive semidefinite weighting matrix $W$, the oracle model solves $$ (a^*,b^*)= \operatorname*{arg\,min}_{(a,b)\in\mathcal{M}} \left\{ B_{a,b}'WB_{a,b} + \operatorname{tr}(WV_{a,b}) \right\}. $$ The bias term falls as the model is refined. The variance term rises as the model is refined. The oracle is the model at which the marginal bias reduction no longer justifies the variance cost. Since $B_{a,b}$ is unknown, the oracle cannot be implemented directly. Lepskii's method estimates its logic indirectly: if a simple model differs from finer models by more than sampling noise, it is biased; if not, the additional flexibility is not empirically warranted. ## 6. Lepskii Selector on a Two-Dimensional Partial Order ### 6.1 Pairwise Comparisons For two comparable models $(a,b) \preceq (a',b')$, define $$ delta_{(a,b),(a',b')} = \hat\tau_{a,b} - \hat\tau_{a',b'}. $$ Its covariance is $$ S_{(a,b),(a',b')} = V_{a,b}+V_{a',b'} -C_{(a,b),(a',b')}-C_{(a,b),(a',b')}', $$ where $$ C_{(a,b),(a',b')}= \operatorname{Cov}(\hat\tau_{a,b},\hat\tau_{a',b'}). $$ The cross-covariance matters because all candidate models are fitted on the same data. Ignoring it is not a principled final implementation, though it can be a useful first diagnostic. ### 6.2 Acceptance Set Candidate $(a,b)$ is accepted if it is close to all finer models: $$ \mathcal{A}_{a,b}=1 \left\{ T_{(a,b),(a',b')} \le c_{(a,b),(a',b')} \text{ for all }(a',b')\succeq(a,b) \right\}. $$ A Wald version uses $$ T_{(a,b),(a',b')} = \delta_{(a,b),(a',b')}' S_{(a,b),(a',b')}^{-1} \delta_{(a,b),(a',b')}. $$ A curve-wise sup-$t$ version uses $$ T_{(a,b),(a',b')} = \max_{s\in\mathcal{S}_0} \left| \frac{\hat\tau_{a,b,s}-\hat\tau_{a',b',s}} {\widehat{se}(\hat\tau_{a,b,s}-\hat\tau_{a',b',s})} \right|, $$ where $\mathcal{S}_0$ is the event-time set used for selection. It may exclude lead periods, the reference period, or very weakly supported tail periods. ### 6.3 Selection Rule The selected model is $$ (\hat a,\hat b) = \operatorname*{arg\,min}_{(a,b)\in\mathcal{M}} \left\{ \operatorname{complexity}(a,b): \mathcal{A}_{a,b}=1 \right\}. $$ A natural complexity score is $$ \operatorname{complexity}(a,b)=K_aR_b. $$ Tie-breakers matter because the grid is partially ordered, not totally ordered. A default deterministic rule is: 1. minimize $K_aR_b$; 2. among ties, prefer the coarser cohort partition; 3. among remaining ties, prefer the coarser event-time partition; 4. if still tied, choose the model with smaller estimated variance $\operatorname{tr}(\widehat V_{a,b})$. The ordering can be changed if the application has a prior preference. For example, one might prefer event-time smoothness before cohort pooling if the substantive object is a smooth dynamic response. ### 6.4 Why Compare to All Finer Models? Comparing only to the saturated model is tempting, but too blunt. A saturated model can be noisy, so the simple-versus-saturated difference may look small even when an intermediate model detects structured bias. Lepskii's logic compares a candidate to the entire finer path. This helps detect cases where the first relevant refinement is, say, early-versus-late cohorts or short-versus-long-run event-time bins, rather than the fully saturated surface. ## 7. Covariance and Calibration The production estimator should estimate the covariance of the whole path. There are two natural routes. ### 7.1 Influence-Function Stack If each target estimate admits an asymptotic linear representation, $$ \sqrt n(\hat\tau_{a,b}-\tau_{a,b}) = \frac{1}{\sqrt n}\sum_{i=1}^n \varphi_{i,a,b}+o_p(1), $$ then stack all influence functions: $$ \varphi_i = (\varphi_{i,0,0}',\varphi_{i,0,1}',\ldots,\varphi_{i,A,B}')'. $$ The covariance of any comparison is obtained by linear transformation of $$ \widehat\Omega = \frac{1}{n}\sum_{i=1}^n \hat\varphi_i\hat\varphi_i'. $$ For fixed-effects least squares with cluster-robust inference, the influence functions can be assembled from residualized regressors and cluster scores. ### 7.2 Cluster Multiplier Bootstrap The more implementation-friendly route is a cluster multiplier bootstrap. For bootstrap draw $r$, draw cluster weights $\xi_i^{(r)}$ with mean zero and unit variance. Recompute the model-path score perturbation and produce $$ \hat\tau_{a,b}^{*(r)} $$ for every candidate $(a,b)$. Then compute the bootstrap analogue of the full selection statistic: $$ Z^{*(r)} = \max_{(a,b)}\max_{(a',b')\succeq(a,b)}\max_{s\in\mathcal{S}_0} \left| \frac{ (\hat\tau_{a,b,s}^{*(r)}-\hat\tau_{a',b',s}^{*(r)}) -(\hat\tau_{a,b,s}-\hat\tau_{a',b',s}) }{ \widehat{se}(\hat\tau_{a,b,s}-\hat\tau_{a',b',s}) } \right|. $$ Set $c$ to a high quantile of $Z^{*(r)}$, such as the $1-\alpha$ quantile. This calibrates simultaneously over models, refinements, and event times. ### 7.3 Conditional Versus Post-Selection Inference The selected curve $$ \hat\tau_{\hat a,\hat b} $$ is a post-selection estimator. Model-conditional standard errors understate uncertainty if interpreted as unconditional confidence bands. There are three practical inference levels: 1. **Selection-only memo:** report the selected model and model-conditional bands, clearly labeled as conditional. 2. **Sample splitting:** use one part of the data to select $(\hat a,\hat b)$ and another part to estimate the selected curve. This is simple but inefficient. 3. **Bootstrap-after-selection:** rerun the selection rule inside each bootstrap draw and use the distribution of selected estimates. This is more expensive but targets the actual adaptive procedure. For the initial research memo, level 1 is enough. For a paper claim, level 3 is the right target. ## 8. Algorithm **Input:** panel data, adoption dates $G_i$, event-time support $\mathcal{S}$, cohort partition path $\{\mathcal{P}_a\}_{a=0}^A$, event-time partition or basis path $\{\mathcal{Q}_b\}_{b=0}^B$, target weights $L$, comparison set $\mathcal{S}_0$, threshold rule. **Step 1: Fit candidate models.** For every $(a,b)$, fit the fixed-effects regression implied by cohort blocks $\mathcal{P}_a$ and event-time blocks or basis $\mathcal{Q}_b$. **Step 2: Common target map.** Convert each model estimate to the full cohort-by-event-time surface $\hat\theta_{a,b}$ and target curve $\hat\tau_{a,b}=L\hat\theta_{a,b}$. **Step 3: Joint covariance.** Estimate covariance for all differences $\hat\tau_{a,b}-\hat\tau_{a',b'}$ using either a path-level influence-function stack or cluster multiplier bootstrap. **Step 4: Lepskii acceptance.** Mark $(a,b)$ acceptable if all comparisons to finer models are below the calibrated threshold. **Step 5: Select.** Choose the acceptable model with minimal complexity and the pre-specified tie-breaker. **Step 6: Report.** Report selected cohort bins, selected event-time bins, selected target curve, comparison diagnostics, and sensitivity to the threshold and candidate path. ## 9. Implementation Details for the Current Codebase The current proof-of-concept already implements the cohort-pooling axis. The path is $$ K_a\in\{1,2,3,5,9\}, $$ with a saturated event-time dummy basis. To complete the two-way version, add an event-time path such as $$ R_b\in\{3,5,9,|\mathcal{S}|\}. $$ A concrete first event-time path for leads $-5,\ldots,-2$ and lags $0,\ldots,12$ could be: - $R=3$: leads, short-run lags $0$--$2$, medium/long-run lags $3+$; - $R=5$: leads, $0$--$1$, $2$--$4$, $5$--$8$, $9+$; - $R=9$: narrower adjacent lag bins; - saturated: one dummy per event time. For each $(K,R)$ pair, the design matrix uses indicators for cohort-bin by event-time-bin rectangles. The model should still produce a full event-time curve by expanding each event-time bin coefficient back onto its constituent event times before applying $L$. A good software interface is: ```python fit = LepskiiEventStudy( cohort_bins=[1, 2, 3, 5, 9], event_time_bins=[3, 5, 9, "saturated"], selection_event_times=range(0, 13), threshold="multiplier-bootstrap", cluster="unit_id", ) fit.fit(df) fit.selection_summary() fit.target_curve() ``` The internal result table should contain one row per `dgp × cohort_bin_count × event_bin_count × event_time`, with columns: - estimate; - standard error; - target weight denominator; - active cohorts; - selected cohort-bin count; - selected event-bin count; - comparison statistic against each finer model. ## 10. Numerical Experiment: Current One-Axis Prototype The current numerical experiment uses nine treated adoption cohorts with uneven cohort sizes. The candidate path is $$ 1,\ 2,\ 3,\ 5,\ 9 $$ adjacent adoption-cohort bins, where one bin is the naive pooled event study and nine bins is the saturated endpoint. In the current prototype, all specifications use the same saturated event-time basis and are mapped into the same aggregate event-study target. The next implementation step is to cross this cohort path with the event-time path described above. The gray curves in the existing figure are the true cohort-specific treatment effects. The black curve aggregates those heterogeneous curves into the true event-study target using treated cohort sizes at each event time. The colored curves show the naive pooled estimate, the saturated aggregate estimate, and the Lepskii-selected estimate. The bands are model-conditional pointwise 95 percent intervals; they do not yet account for selection uncertainty. ```{python} #| column: screen #| fig-cap: "Many-cohort DGPs with true aggregate event-study targets, naive pooled estimates, saturated aggregate estimates, and Lepskii-selected estimates. Current prototype varies only cohort pooling; the proposed extension also varies event-time pooling." import sys from pathlib import Path import altair as alt import pandas as pd sys.path.insert(0, str(Path("..") / "code")) from plot_lepskii_many_cohort_dgps import make_chart cohort_truth = pd.read_csv("../figtab/lepskii_many_cohort_cohort_truth.csv") target_truth = pd.read_csv("../figtab/lepskii_many_cohort_target_truth.csv") estimates = pd.read_csv("../figtab/lepskii_many_cohort_estimates.csv") selection_summary = pd.read_csv("../figtab/lepskii_many_cohort_selection_summary.csv") make_chart(cohort_truth, target_truth, estimates, num_units=6000) ``` The realized selections in this one-axis draw are: ```{python} #| tbl-cap: "Selected number of adjacent cohort bins in the current one-axis experiment." selection_table = selection_summary[ ["dgp_label", "latent_type", "selected_bin_count", "path_bin_counts"] ].rename( columns={ "dgp_label": "DGP", "latent_type": "Latent heterogeneity", "selected_bin_count": "Selected cohort bins", "path_bin_counts": "Candidate cohort bins", } ) selection_table ``` Because the true aggregate target is known in the simulation, we can compare realized mean squared error against the true event-study curve. The metric below averages squared error over post-treatment event times, $$ \operatorname{MSE}_d(a) = \frac{1}{|\mathcal{S}_+|} \sum_{s\in\mathcal{S}_+} \left(\hat\tau_{d,a,s}-\tau_{d,s}\right)^2, $$ where $d$ indexes the DGP and $a$ is the estimator. This is not an oracle risk over repeated samples; it is an in-draw diagnostic. ```{python} #| column: screen #| fig-cap: "Post-treatment MSE against the true aggregate event-study target in the current one-axis prototype." mse = ( estimates.query("event_time >= 0") .assign(squared_error=lambda x: (x["estimate"] - x["true_effect"]) ** 2) .groupby(["dgp", "dgp_label", "latent_type", "model"], as_index=False) .agg( mse=("squared_error", "mean"), selected_bins=("selected_bin_count", "first"), estimator_bins=("bin_count", "first"), ) ) mse["best_mse"] = mse.groupby("dgp")["mse"].transform("min") mse["relative_to_best"] = mse["mse"] / mse["best_mse"] mse["relative_to_best"] = mse["relative_to_best"].where(mse["best_mse"] > 0, 1.0) dgp_order = [ "Common path", "Early vs late", "Three epochs", "Smooth dose gradient", "Response timing", "Local window shock", "Log vs shark vs sin", "No pooling: signatures", "No pooling: peak timing", "No pooling: signed paths", ] model_order = ["Naive pooled", "Lepskii selected", "Saturated"] mse_plot = mse.assign( model_short=mse["model"].map( { "Naive pooled": "Naive", "Lepskii selected": "Lepskii", "Saturated": "Saturated", } ) ) ( alt.Chart(mse_plot) .mark_bar() .encode( x=alt.X( "model_short:N", title=None, sort=["Naive", "Lepskii", "Saturated"], axis=alt.Axis(labelAngle=0, labelFontSize=11), ), y=alt.Y( "mse:Q", title=None, axis=alt.Axis(labelFontSize=11, titleFontSize=12), ), color=alt.Color( "model:N", title="Estimator", sort=model_order, scale=alt.Scale( domain=model_order, range=["#d95f02", "#1b9e77", "#7570b3"], ), ), tooltip=[ alt.Tooltip("dgp_label:N", title="DGP"), alt.Tooltip("latent_type:N", title="Latent heterogeneity"), alt.Tooltip("model:N", title="Estimator"), alt.Tooltip("estimator_bins:Q", title="Estimator bins"), alt.Tooltip("selected_bins:Q", title="Selected bins"), alt.Tooltip("mse:Q", title="MSE", format=".5f"), alt.Tooltip("relative_to_best:Q", title="MSE / best", format=".2f"), ], ) .properties(width=360, height=220) .facet( facet=alt.Facet("dgp_label:N", title=None, sort=dgp_order), columns=3, ) .resolve_scale(y="independent") .properties(bounds="flush", spacing=20) .configure_header( labelFontSize=13, labelLimit=210, labelPadding=8, ) .configure_legend( labelFontSize=12, titleFontSize=12, orient="bottom", ) ) ``` ```{python} #| tbl-cap: "Post-treatment MSE by DGP and estimator in the current one-axis prototype." mse_table = ( mse.pivot_table( index=["dgp_label", "latent_type"], columns="model", values="mse", ) .reset_index() .rename( columns={ "dgp_label": "DGP", "latent_type": "Latent heterogeneity", "Naive pooled": "Naive MSE", "Lepskii selected": "Lepskii MSE", "Saturated": "Saturated MSE", } ) ) mse_table ``` The current results show the mechanism works in simple cases but also reveal why the two-way version matters. If the true surface varies smoothly over event time, forcing a saturated event-time basis can make all cohort comparisons noisy. If the true surface varies over cohorts but is simple over event time, event-time pooling can recover power for detecting the relevant cohort structure. The reverse is also possible: if cohorts are similar but dynamics are nonlinear, cohort pooling should remain coarse while event-time pooling refines. ## 11. Simulation Designs Needed for the Two-Way Version The next simulation suite should include DGPs that separately stress the two axes. ### 11.1 Cohort-Only Structure Cohorts differ by adoption regime, but each regime has a smooth common dynamic shape. The oracle should select multiple cohort bins but few event-time bins. ### 11.2 Time-Only Structure All cohorts share a dynamic path with short-run, medium-run, and long-run components. The oracle should select one cohort bin but several event-time bins. ### 11.3 Tensor Structure Early and late cohorts differ only in long-run effects, not short-run effects. The oracle should select both cohort and event-time bins, but not the fully saturated surface. ### 11.4 Local Shock A narrow adoption-window shock affects a subset of cohorts and a subset of event times. This tests whether rectangular bins are too crude and whether the grid needs localized refinements. ### 11.5 No Pooling Every cohort has a distinct, non-smooth event-time path. The oracle should move toward the saturated endpoint on both axes. ## 12. What Needs to Be Proved A formal paper needs at least four pieces. ### 12.1 Identification of the Saturated Target Under no anticipation, parallel trends, and overlap, the saturated regression or chosen heterogeneity-robust estimator identifies $\theta_{j,s}$ for all supported cells. Therefore $L\theta$ is identified. ### 12.2 Projection Bias of Restricted Models For each restricted model $M_{a,b}$, characterize the estimand as a projection of the saturated surface: $$ \theta_{a,b}^{proj} = H_{a,b}(H_{a,b}'\Omega H_{a,b})^{-1}H_{a,b}'\Omega\theta, $$ for an appropriate design-weight matrix $\Omega$. The target bias is $$ B_{a,b}=L(\theta_{a,b}^{proj}-\theta). $$ This makes clear that partially pooled models are biased only to the extent that the true surface is not well approximated by the selected rectangles or basis. ### 12.3 Uniform Asymptotic Linearity Over the Path Show that, uniformly over the finite candidate grid, $$ \sqrt n(\hat\tau_{a,b}-\tau_{a,b}^{proj}) = \frac{1}{\sqrt n}\sum_i \varphi_{i,a,b}+o_p(1). $$ For a fixed finite grid, this should follow from standard cluster-robust least squares arguments plus uniform nonsingularity of the residualized design. ### 12.4 Lepskii Oracle Inequality With calibrated threshold $c_n$, prove a finite-grid oracle statement of the form $$ \|\hat\tau_{\hat a,\hat b}-\tau\|_W^2 \le C\min_{a,b}\left\{ \|B_{a,b}\|_W^2 + \operatorname{pen}_{a,b} \right\} + r_n, $$ with high probability, where $\operatorname{pen}_{a,b}$ is a variance or critical-value penalty and $r_n$ is small. This is the formal version of the bias-variance story. ## 13. Practical Reporting Template An empirical implementation should report: 1. the selected number of cohort bins and event-time bins; 2. the actual cohort and event-time partitions; 3. the selected event-study curve; 4. the pooled and saturated curves for reference; 5. a heatmap of accepted/rejected candidate models over the $(K,R)$ grid; 6. comparison statistics that drove selection; 7. threshold sensitivity; 8. whether bands are model-conditional or selection-adjusted. This reporting is important because the selected model is itself substantive. If the procedure selects early-versus-late cohorts and short-run-versus-long-run event-time bins, that is a concise empirical summary of the heterogeneity. ## 14. Discussion Adaptive partial pooling reframes event-study heterogeneity as an estimation problem rather than a binary diagnostic. The saturated estimator provides bias protection under arbitrary cohort-by-event-time heterogeneity, but it can be noisy. The pooled event study is precise, but fragile when cohort dynamics differ. A Lepskii selector over nested cohort and event-time restrictions gives a data-dependent middle ground: estimate the target curve with the coarsest model that remains statistically close to more flexible alternatives. The two-axis version is important. Cohort pooling alone cannot exploit smooth or piecewise-constant event-time structure. Event-time pooling alone cannot handle adoption-regime heterogeneity. The tensor grid lets the data choose a simple surface when simple is adequate and a richer surface when the evidence demands it. ## 15. Next Implementation Steps 1. Add event-time partition paths to `code/lepskii.py`. 2. Refactor the candidate-model output so every model expands to the same full event-time grid before aggregation. 3. Cross the existing cohort path `[1, 2, 3, 5, 9]` with event-time paths such as `[3, 5, 9, "saturated"]`. 4. Replace the diagonal comparison variance with a cluster multiplier bootstrap over the whole path. 5. Add grid heatmaps for selected/accepted models and comparison statistics. 6. Run the two-axis simulation designs in Section 11. The immediate research claim should be modest: the method is an adaptive, interpretable way to move between pooled and saturated event studies. The strong paper claim requires the bootstrap calibration and an oracle-style finite-grid risk result.