Counterfactual Prediction with Time-Varying Treatments + G-Computation

Here I document an algorithm known as G-computation, which is a tool for doing counterfactual prediction in time series settings.

Set-up

We assume that we have nn iid data strings XiPX_{i} \sim \mathbb{P} where:

X=(AˉT,LˉT,Y)=(A1,L1,A2,L2,AT,LT,Y) X = (\bar{A}_{T}, \bar{L}_{T}, Y) = (A_1, L_1, A_2, L_2, \dots A_T, L_T, Y)

Given the iid assumption, we suppress dependence on subject index ii. We have a full time series, t=1,2,Tt = 1, 2, \dots T, for both AtA_{t} and LtL_{t}. Here we let Aˉt=(A1,At)\bar{A}_{t} = (A_{1}, \dots A_{t}) and Lˉt=(A1,Lt)\bar{L}_{t} = (A_{1}, \dots L_{t}) for t1t \geq 1.

Let AtA_{t} be a treatment variable on date t. When At=1A_{t} = 1 treatment is on and when At=0A_{t} = 0 treatment is off. We let LtL_{t} denote a set of confounders on date tt. Notice that in almost all practical settings, LtL_{t} is actually a vector of variables, but for ease of exposition it suffices to treat it as a single variable. In the following sections, anytime we include LtL_{t} in a regression model, just keep in mind that there are in fact multiple associated coefficients, rather than just one. Finally, YY is an outcome of interest.

Because we are interested in counterfactual prediction, we denote Y(aˉT)Y(\bar{a}_T) as the potential outcome when one sets Aˉt=aˉt\bar{A}_{t} = \bar{a}_{t}. If AtA_{t} is binary, there are 2T2^{T} such quantities, one for each possible treatment regime aˉT\bar{a}_{T}. In general, these quantities are unobserved, while YY is the observed quantity in our dataset.

A DAG which describes the set-up above is as follows:

For simplicity, we only consider the last two periods, T1T-1 and TT, of our data string. It is reasonable to assume that all past confounders and past treatments affect all future confounders, treatment and the final outcome. Moreover, we are dealing with observational data and have tried to measure all possible confounders. However, it is likely there are some unobserved factors, UU, which influence the confounders, LtL_{t}, and outcome YT+1Y_{T + 1}. As a result, UU will arrows into LT1L_{T - 1}, LTL_{T} and YT+1Y_{T + 1}.

Why can’t we use standard regression?

An intuitive approach would be to assume a linear model as follows:

Y=α+t=1TβtLt+t=1TωtAt+ϵ Y = \alpha + \sum_{t = 1}^{T} \beta_t L_t + \sum_{t = 1}^{T} \omega_t A_t + \epsilon

We could fit this with OLS and set E^[Y(aˉT)]=E^[E^[YLˉT,AˉT=aT]]\hat{\mathbb{E}}[Y(\bar{a}_T)] = \hat{\mathbb{E}}[\hat{\mathbb{E}}[Y \mid \bar{L}_T, \bar{A}_T = a_T]]. The estimated conditional expectations come from the OLS fits for particular samples. The estimated outer expectation is an average of these fits over all the samples.

The above strategy would be valid if AˉTY(aˉT)LˉT\bar{A}_{T} \perp Y(\bar{a}_T) \mid \bar{L}_T. This would be similar to the usual “selection on observables” assumption made in causal inference. However, the time series dependence in our set-up makes such an assumption highly unlikely.

The core issue boils down to collider bias. From our DAG above, it is clear that we need to condition on LTL_{T}. It points to both YY and ATA_{T} and is thus a confounder. However, notice that both AT1A_{T - 1} and UU point to LTL_{T}. Thus, LTL_{T} is a collider and including it in our regression induces a correlation between AT1A_{T - 1} and UU. Since UU points to YY, the conditional independence statement above fails. By conditioning on LTL_{T}, our DAG now includes the path denoted by the red arrows:

In addition to collider bias, when we condition on LTL_{T}, we block a causal path of interest. Specifically, AT1LTYA_{T - 1} \rightarrow L_{T} \rightarrow Y. Thus, it seems like we need to condition on LTL_{T} as it is a confounder, but doing so will get us biased predictions for Y(aˉT)Y(\bar{a}_T).

G-Computation

A solution to the above issue is G-computation. It was originally proposed in R1986 and additional details relevant to our set-up are included in RGH1999. The key is to make the following identification assumption:

Identification Assumption: For all t=1,Tt = 1, \dots T:

AtY(aˉT)(Aˉt1,Lˉt) A_{t} \perp Y(\bar{a}_T) \mid (\bar{A}_{t - 1}, \bar{L}_{t})

That is conditional on the past history of treatments and confounders, treatment at any given time is random.

Combined with the standard causal assumptions of positivity and consistency, it is possible to estimate E[Y(aˉT)]\mathbb{E}[Y(\bar{a}_T)] without bias.

Theorem: For any aˉT\bar{a}_{T}:

E[Y(aˉT)]=L1L2 ⁣LTE[YLˉT,AˉT=at]t=1Tf(LtLˉt1,Aˉt1=aˉt1) \mathbb{E}[Y(\bar{a}_T)] = \int_{L_1} \int_{L_2} \dots \int_{L_T} \mathbb{E}[Y \mid \bar{L}_T, \bar{A}_T = a_{t}] \prod_{t = 1}^T f(L_{t} \mid \bar{L}_{t-1}, \bar{A}_{t-1} = \bar{a}_{t-1})

Here ff are the conditional densities of LtL_{t} given the past.

Proof:

E[Y(aˉT)]=E[E[Y(aˉT)L1]]=E[E[Y(aˉT)L1,A1=a1]]=E[E[E[Y(aˉT)L1,L2,A1=a1]L1,A1=a1]]=E[E[E[Y(aˉT)Lˉ2,Aˉ2=aˉ2]L1,A1=a1]]=E[E[E[E[Y(aˉT)Lˉ3,Aˉ2=aˉ2]Lˉ2,Aˉ2=aˉ2]L1,A1=a1]]=E[E[E[E[Y(aˉT)Lˉ3,Aˉ3=aˉ3]Lˉ2,Aˉ2=a2]L1,A1=a1]]==E[E[E[Y(aˉT)LˉT,AˉT=aˉT]LˉT1,AˉT1=aˉT1]L1,A1=a1]]=E[E[E[YLˉT,AˉT=aˉT]LˉT1,AˉT1=aˉT1]L1,A1=a1]]=L1L2 ⁣LTE[YLˉT,AˉT=at]t=1Tf(LtLˉt1,Aˉt1=aˉt1) \begin{aligned} \mathbb{E}[Y(\bar{a}_T)] &= \mathbb{E}[\mathbb{E}[Y(\bar{a}_T) \mid L_{1}]]\\ &= \mathbb{E}[\mathbb{E}[Y(\bar{a}_T) \mid L_{1}, A_{1} = a_1]]\\ &= \mathbb{E}[\mathbb{E}[\mathbb{E}[Y(\bar{a}_T) \mid L_{1}, L_{2}, A_{1} = a_1] \mid L_{1}, A_{1} = a_1]]\\ &= \mathbb{E}[\mathbb{E}[\mathbb{E}[Y(\bar{a}_T) \mid \bar{L}_{2}, \bar{A}_2 = \bar{a}_2] \mid L_{1}, A_{1} = a_1]]\\ &= \mathbb{E}[\mathbb{E}[\mathbb{E}[\mathbb{E}[Y(\bar{a}_T) \mid \bar{L}_3, \bar{A}_2 = \bar{a}_2] \mid \bar{L}_{2}, \bar{A}_2 = \bar{a}_2] \mid L_{1}, A_{1} = a_1]]\\ &= \mathbb{E}[\mathbb{E}[\mathbb{E}[\mathbb{E}[Y(\bar{a}_T) \mid \bar{L}_3, \bar{A}_3 = \bar{a}_3] \mid \bar{L}_2, \bar{A}_2 = a_2] \mid L_{1}, A_{1} = a_1]]\\ &= \dots \\ &= \mathbb{E}[ \dots \mathbb{E}[\mathbb{E}[Y(\bar{a}_T) \mid \bar{L}_{T}, \bar{A}_{T} = \bar{a}_{T}] \mid \bar{L}_{T - 1}, \bar{A}_{T - 1} = \bar{a}_{T - 1}] \dots \mid L_{1}, A_{1} = a_1]]\\ &= \mathbb{E}[ \dots \mathbb{E}[\mathbb{E}[Y \mid \bar{L}_{T}, \bar{A}_{T} = \bar{a}_{T}] \mid \bar{L}_{T - 1}, \bar{A}_{T - 1} = \bar{a}_{T - 1}] \dots \mid L_{1}, A_{1} = a_1]]\\ &= \int_{L_1} \int_{L_2} \dots \int_{L_T} \mathbb{E}[Y \mid \bar{L}_T, \bar{A}_T = a_{t}] \prod_{t = 1}^T f(L_{t} \mid \bar{L}_{t-1}, \bar{A}_{t-1} = \bar{a}_{t-1}) \end{aligned}

The steps here rely on repeated application of the law of total expectation and then the key identification assumption above. The second to last equality relies on consistency. The last equality relies on positivity which ensures that the integrand can be evaluated for all possible values AtA_{t} and LtL_{t}.

A natural estimator E^[Y(aˉT)]\mathbb{\hat{E}}[Y(\bar{a}_T)] is to simply plugin a fitted outcome model E^[YLˉT,AˉT=at]\mathbb{\hat{E}}[Y \mid \bar{L}_T, \bar{A}_T = a_{t}] and conditional density estimators f^(LtLˉt1,Aˉt1=aˉt1)\hat{f}(L_{t} \mid \bar{L}_{t - 1}, \bar{A}_{t - 1} = \bar{a}_{t - 1}) learned from one’s data. A standard parametric approach would be to use GLMs and make a Markov assumption to limit the number of lags. Non-parametric methods ranging from kernel smoothers to modern ML methods can also be used here. Correctly specified parametric models will typically imply convergence to the truth at a n\sqrt{n}-rate and normal limiting distributions. Nice convergence properties for general non-parametric methods are not as easily guaranteed, but do hold under specific smoothness assumptions. In all cases, standard errors can be evaluated via the bootstrap.

In general, when using non-linear models, the integral is intractable to compute analytically. Instead, one relies on a Monte Carlo simulation. Specifically, for i[N]i \in [N], where NN is the number of simulation samples, one does the following:

  1. For t[T]t \in [T]:

    a. Set At=atA_{t} = a_t

    b. Draw Ltf^(LtLˉt1,Aˉt1=aˉt1)L_{t} \sim \hat{f}(L_{t} \mid \bar{L}_{t - 1}, \bar{A}_{t - 1} = \bar{a}_{t - 1})

  2. Set Y^i(aˉT)=E^[YLˉT,AˉT=aˉt]\hat{Y}_{i}(\bar{a}_T) = \hat{\mathbb{E}}[Y \mid \bar{L}_T, \bar{A}_T = \bar{a}_{t}]

Then, E^[Y(aˉT)]=1Ni=1NY^i(aˉT)\hat{\mathbb{E}}[Y(\bar{a}_T)] = \frac{1}{N} \sum_{i = 1}^{N} \hat{Y}_i(\bar{a}_T). To get standard errors, we generate BB bootstrap samples and the whole algorithm is re-run starting with fitting the the outcome model and conditional density estimators.

Marginal Structural Models

An alternative to G-computation is to use marginal structural models which posit a specific parametric form for E[Y(aˉT)]=g(aˉT,β)\mathbb{E}[Y(\bar{a}_T)] = g(\bar{a}_T, \beta) and use IPW estimators of β\beta. Non-parametric G-computation always implies a particular form of a MSM. Hence if the models are correctly specified, both techniques should provide the same answers. So, as a robustness check one should always try to fit both to make sure the results are similar. If not, one of the models is incorrectly specified. In the parametric case, this equivalence does not hold. There is an issue with parametric G-computation known as the “null paradox”: there may be no setting for one’s parametric models which accommodate the case where the outcome is conditionally dependent on past treatment but has no effect. Hence if you think you are in a regime where the treatment has no effect and non-parametric models seem infeasible (i.e. you have a small or very noisy sample), you should use MSMs. Finally, there are doubly-robust MSM approaches which may be useful in case you are worried about model specification or think you can model propensity scores well.