Policy-aligned CATE estimation

Show code

from pathlib import Path
import pandas as pd

BASE = Path('/Users/alal/tmp/cate_policy')
ASSETS = BASE / 'report_assets'
summary = pd.read_csv(ASSETS / 'summary.csv')
summary_rounded = summary.copy()
for col in summary_rounded.columns[1:]:
    summary_rounded[col] = summary_rounded[col].map(lambda x: round(float(x), 4))
summary_rounded

	method	profit_mean	profit_sd	regret_mean	regret_sd	mse_mean	mse_sd
0	Oracle	0.4451	0.0024	0.0000	0.0000	0.0000	0.0000
1	policyCATE logistic σ=0.25	0.4432	0.0041	0.0018	0.0034	2.6670	2.2311
2	policyCATE logistic σ=0.75	0.4126	0.0153	0.0325	0.0152	0.5249	0.0797
3	policyCATE logistic σ=2.0	0.3833	0.0175	0.0617	0.0175	0.4639	0.0142
4	policyCATE uniform (OLS special case)	0.3771	0.0177	0.0680	0.0176	0.4615	0.0103
5	LinearDML	0.3763	0.0047	0.0688	0.0042	0.4505	0.0029

What this is

I pulled the TeX source for arXiv:2512.13400 and implemented the paper’s linear M-estimator in trex.policy_learning.policyCATElearner.

The comparison below uses the paper’s simple quadratic DGP, but with a deliberately misspecified linear CATE model. That is the case where the paper argues decision-focused fitting should help most.

treatment cost: c = 1
train size per replication: 4000
test size per replication: 30000
replications: 25
regret: oracle profit minus realized policy profit

Methods compared

Oracle assignment
econml.dml.LinearDML
policyCATElearner with logistic surrogate at three σ values
policyCATElearner with the uniform surrogate, which collapses to transformed-outcome OLS

Summary table

Show code

summary_rounded

	method	profit_mean	profit_sd	regret_mean	regret_sd	mse_mean	mse_sd
0	Oracle	0.4451	0.0024	0.0000	0.0000	0.0000	0.0000
1	policyCATE logistic σ=0.25	0.4432	0.0041	0.0018	0.0034	2.6670	2.2311
2	policyCATE logistic σ=0.75	0.4126	0.0153	0.0325	0.0152	0.5249	0.0797
3	policyCATE logistic σ=2.0	0.3833	0.0175	0.0617	0.0175	0.4639	0.0142
4	policyCATE uniform (OLS special case)	0.3771	0.0177	0.0680	0.0176	0.4615	0.0103
5	LinearDML	0.3763	0.0047	0.0688	0.0042	0.4505	0.0029

Profit-regret tradeoff

Fitted CATE curves on one draw

Paper distillation

Here is the implementation-level distillation of the paper.

Core idea

The paper starts from the firm’s targeting rule: treat when the true CATE exceeds the treatment cost \(c\).

Instead of optimizing a hard threshold at \(c\), it randomizes the threshold as \(C \sim F_C(c, \sigma)\) and maximizes the resulting smooth surrogate. At the population level, for a scalar candidate effect \(\tau\), the objective is

\[ q(\tau)=\int_{-\infty}^{\tau}(\tau_0-u)f_C(u)\,du, \]

which is Fisher consistent and uniquely maximized at the true effect \(\tau = \tau_0\).

An equivalent form is

\[ F_C(\tau)\{\tau_0-\kappa_C(\tau)\}, \qquad \kappa_C(\tau)=\mathbb{E}[C\mid C\le \tau]. \]

Sample estimator

Using the transformed outcome

\[ Y_i^*=Y_i\left(\frac{W_i}{e(X_i)}-\frac{1-W_i}{1-e(X_i)}\right), \]

the paper proposes the M-estimator

\[ \hat\theta= \arg\max_{\theta\in\Theta} \frac1n\sum_{i=1}^n F_C[\tau(X_i;\theta)] \Big\{Y_i^* - \kappa_C[\tau(X_i;\theta)]\Big\}. \]

So in code, each observation contributes a smooth objective \(q_i(\theta)\) rather than a plain squared-error term.

Gradient structure and decision attention

The key gradient identity is

\[ \nabla_\theta q_i(\theta) = f_C(\tau_i)\,(Y_i^*-\tau_i)\,\nabla_\theta \tau_i, \qquad \tau_i=\tau(X_i;\theta). \]

This is basically weighted residual fitting, but the weights are endogenous and largest near the decision boundary. That is the paper’s “Decision Attention” mechanism.

large \(\sigma\): broad weighting, close to ordinary transformed-outcome MSE
small \(\sigma\): concentrated weighting near \(\tau \approx c\), closer to direct policy optimization

Closed-form choices

Logistic threshold

This is the paper’s default:

\[ q(\tau_i)=G(\bar\tau_i)(Y_i^*-c)+\sigma H[G(\bar\tau_i)], \qquad \bar\tau_i=(\tau_i-c)/\sigma, \]

where \(G\) is the logistic CDF and \(H(p)=-p\log p-(1-p)\log(1-p)\) is binary entropy.

This directly matches entropy-regularized policy learning.

Normal threshold

\[ q(\tau_i)=\Phi(\bar\tau_i)(Y_i^*-c)+\sigma\phi(\bar\tau_i). \]

Uniform threshold

\[ q(\tau_i)=-(Y_i^*-\tau_i)^2, \]

which recovers transformed-outcome OLS as a special case.

Why this is useful

The framework continuously interpolates between three familiar objects:

plug-in CATE estimation via squared error
direct policy optimization as \(\sigma \to 0\)
entropy-regularized sigmoid policy learning under the logistic choice

So it is not just a new loss. It is a clean bridge between prediction-focused CATE estimation and profit-focused policy learning.

Practical implementation takeaways

For implementation, the paper strongly suggests:

start with the logistic surrogate
use regularization, especially when \(\sigma\) is small
warm start along a continuation path from large \(\sigma\) to small \(\sigma\)
tune \(\sigma\) on held-out policy value if profit is the real target

That is exactly why the small-\(\sigma\) model in the comparison above can have much worse global CATE MSE but much lower regret.

Read

A few things stand out.

Small \(\sigma\) sharply reduces regret, even though it makes global CATE MSE much worse.
The uniform special case behaves like plain transformed-outcome OLS.
LinearDML is competitive on MSE, but on this misspecified decision problem it still leaves materially more profit on the table than the tighter policy-focused surrogate.
This is exactly the paper’s point: if the business objective is treatment assignment, uniform accuracy is not the right target.