If you want to measure the causal effect of a treatment what you need is a counterfactual. What would have happened to the units if they had not got the treatment? Unless your unit is Gwyneth Paltrow in Sliding Doors, you only observe one state of the world. So the key to causal inference is to reconstruct the untreated state of the world. Athey et al. in their paper show how matrix completion can be used to estimate this unobserved counterfactual world. You can treat the unobserved (untreated) states of the treated units as missing and use a penalized SVD to reconstruct these from the rest of the dataset. If you are familiar with the econometric literature on synthetic controls, fixed effects, or unconfoundedness you should definitely read the paper; it shows these as special cases of matrix completion with the missing data of a specific form. Actually, you should read the paper anyway. Most of it is quite approachable and it’s very insightful.
Also, check out this great twitter thread by Scott Cunningham and Cyrus Samii’s notes] on it.
Data setup
Say you have some panel data with $N$ units and $T$ time periods. At some time period $t_{0,n}$ (which can be different for each unit), some of the units get the treatment. So from $(t_{0,n}, T)$ you don’t really see the untreated state of the world. It is “missing”. We’ll use the same dataset, the Abadie 2010 California smoking data, that the authors use in the paper for the demo:
Let’s allow each unit to have a different $t_0$ with the minimum being 16 and pick 15 random units to be treated.
What we will observe is Y_obs
:
and we’ll try and recreate Y
. The figure below shows what these datasets look like. The white bits on the right are the “missing” entries in the matrix.
The algorithm
Our job is to reconstruct a matrix $L$ such that:
\[\mathbf{Y} = \mathbf{L^* } + \epsilon\]Before we get to the estimator for $L^* $, let’s define a few support functions:
\[\text{shrink}_{\lambda}(\mathbf{A}) = \mathbf{S \tilde{\Sigma} R}^{\text{T}}\]where $\mathbf{\tilde{\Sigma}}$ is equal to $\mathbf{\Sigma}$ with the i-th singular value $\sigma_i(\mathbf{A})$ replaced $\text{max}(\sigma_i(\mathbf{A}) - \lambda, 0)$. So you are doing a SVD and shrinking the eigenvalues towards zero. Here’s the python code for it:
And then
\[\begin{aligned} \mathbf{P_{\mathscr{O}}(A)} = \begin{cases} A_{it}& \text{if } (i,t) \in \mathscr{O}\\ 0 & \text{if } (i,t) \notin \mathscr{O} \end{cases}, && \mathbf{P^{\perp}_{\mathscr{O}}(A)} = \begin{cases} 0 & \text{if } (i,t) \in \mathscr{O}\\ A_{it}& \text{if } (i,t) \notin \mathscr{O} \end{cases} \end{aligned}\]In python:
And now for the main algorithm. The paper shows the general form of the estimator but here we will implement the iterative (probably slower) version. It’s quite beautiful in its simplicity. For $k = 1,2,…,$ define:
\[\mathbf{L}_{k+1}(\lambda, \mathscr{O}) = \text{shrink}_{\frac{\lambda|\mathscr{O}|}{2}} \{ \mathbf{P_{\mathscr{O}}(Y)} + \mathbf{P^{\perp}_{\mathscr{O}}(L_{\lambda})} \}\]and we initialize it as
\[\mathbf{L}_{1}(\lambda, \mathscr{O}) = \mathbf{P_{\mathscr{O}}(A)}\]Note that $\mathscr{O}$ is the set of coordinates of the matrix where the data is not missing i.e. the units were not treated.
We run this until $\mathbf{L}$ converges. Here’s the python code:
Cross-validation
We still need to figure what $\lambda$ needs to be so we cross-validate. The implementation below is not perfect since it doesn’t simulate the full dataset exactly. I’m picking a random subset of coordinates to take out as the test set and training using the rest. I’m not removing everything after a some time $t$ for each unit as I really should. Check out the note in the paper on cross validation for $\lambda$. But the implementation below should give you a good sense (and a good start if you want to improve it) of how to do it. While you’re at it, you may want to use dask distributed to parallelize it.
Let’s run the cross-validation and check out the results:
Here are the results (darker is smaller MSE) and it looks like 9 is the optimal lambda.
Final run and results
Using 9 as our lambda, let’s run it once more and check out the results.
Not bad at all! Looks like we lost some resolution there but we only had 38 records so pretty decent. I bet with a larger dataset with more controls, it would do even better.
Next steps
We don’t include any covariate here but the paper shows how you can do that. Athey and team has also made their code and test data available online which deserves a round of applause. It is still such a rare thing in economics. You can go check out their code here. It also includes a R package which will be a hell of a lot faster than my quick and dirty code.
The notebook with all my code can be found here. Happy matrix completing!