Last month, I did a post on how you could setup your HMM in pymc3. It was beautiful, it was simple. It was a little too easy. The inference button makes setting up the model a breeze. Just define the likelihoods and let pymc3 figure out the rest.
I have been reading Murphy’s thesis for work and decided it’d be a whole lot of fun to implement the EM algorithm for learning. So here we go. The code can be found online here. Murphy’s Machine Learning book was very useful when coding up this algorithm.
Since we are going to be getting into the weeds a bit, let’s define some terms. We’re going to use the notation from Murphy’s thesis. Time goes from $1 … T$.
- $y_{1:t} = (y_1,…,y_t)$: the observations up to present time, $t$.
- $X_t$: The hidden state at time $t$.
- $P(X_t \vert y_{i:t})$: Belief state at time $t$.
- $P(X_t \vert X_{t-1})$: State-transition function. We assume we have a first-order Markov with transition matrix $A$.
- $\pi$: $P(X_1)$ - the initial state
There are a few types of inferences as per Murphy:
- Filtering: Compute the belief state
- Smoothing: Compute $P(X_t\vert y_{1:T})$ offline, given all the evidence.
- Fixed lag smoothing: Slightly lagged filtering - $P(X_{t-l}\vert y_{1:t})$
- Prediction: Predict the future - $P(X_{t+l}\vert y_{1:t}$)
We’re just going to look at the first two: filtering and smoothing. Then use some of these functions to implement an EM algorithm for learning the parameters.
Generate data
We’ll use the utility classes from the last post to generate some data:
vals
are all we observe.
Filtering - The forwards algorithm
Check out Murphy for the full derivation but the forward algorithm boils down to this:
\[\begin{aligned} \mathbf{\alpha}_t &= P(X_t \vert y_{i:t})\\ \mathbf{\alpha}_t &\propto O_t A' \alpha_{t-1} \end{aligned}\]where $A$ is the Markov transition matrix that defines $P(X_t \vert X_{t-1})$ and $O_t(i, i) = P(y_t \vert X_t = i)$ is a diagonal matrix containing the conditional likelihoods at time $t$.
Note that $\alpha_t$ is recursively defined with the base case being:
\[\alpha_1 \propto O_1\pi\]And here it is in python:
Here’s the normalize function that is called in forward_pass
that we’ll use again. Pretty straightforward stuff:
Here’s what our filtered signal looks like.
The results are pretty good. You could make the problem a little harder by making the two lambdas closer together or throwing in some noise.
Smoothing - The forwards-backwards algorithm
I like how Murphy explains why uncertainty will be lower in smoothing:
To understand this intuitively, consider a detective trying to figure out who committed a crime. As he moves through the crime scene, his uncertainty is high until he finds the key clue; then he has an “aha” moment, his uncertainty is reduced, and all the previously confusing observations are, in hindsight, easy to explain.
Here we define two new terms:
\[\begin{aligned} \beta_t(j) &= P(y_{t+1:T}|x_t = j)\\ \beta_t(j) &= AO_{t+1}\beta_{t+1}\\ \\ \gamma_t(j) &= P(x_t = j | y_{1:T})\\ \gamma_t(j) &\propto \alpha_t(j) \beta_t(j)\\ \end{aligned}\]As with $\alpha$, $\beta$ is recursively defined with the base case:
\[\beta_T(i) = 1\]Let’s code this $\beta_t$ up in python:
and then we can put the two together to get $\gamma$:
We were already doing quite well but this improves the AUC even further.
EM (Baum-Welch) for parameter learning
All that seems fine but we gave the algorithm $A$, $\pi$, and even the parameters associated with the two hidden states. What we learned using pymc3 last time was all of the parameters using just the observations and a few assumptions.
We’ll do that again using the EM algorithm.
Two-slice distributions
We are going to code up the two-slice distribution since it’s about to come in handy. It is defined as follows:
\[\xi_{t-1,t|T}(i, j) = P(X_{t-1} = i, X_t = j | y_{1:T})\]and Murphy shows that it can be computed as follows:
\[\xi_{t-1,t|T}(i, j) = A \circ (\alpha_t (O_{t+1} \circ \beta_{t+1})^T)\](Note: The formula in Murphy’s thesis didn’t make sense to me so this is going off his book.)
So let’s do it:
E-step
Ugh. There is a tonne of latex here to write. So I’m going take the easy way out and just paste it from the book. Sorry Mr. Murphy if this breaks any laws. Let me know and I’ll remove it immediately. Also, I’m a big fan of your work.
This also means that the notation is a little different. $z$ is the hidden state and $\mathbf{x}$ is the observation.
Since we only have one process, let’s just drop the summation over $N$. In code, this is:
Wow. Such ease. Much elegant.
M-step
This step is even easier. I won’t bother with the math, since the code is quite obvious:
Running the EM algorithm
We just need to run the two step until we converge. EM often gets stuck at local minima so you need to be careful how you initialize the variables. Check out Murphy 17.5.2.3 for tips on how to tackle that.
Check out the notebook for how it does in learning $A$, $\pi$, and the parameters for the hidden state. It’s not perfect (and of course, no credible intervals like in the bayesian case) but it’s decent.
And here’s what the smoothed signal looks like:
Conclusion
I enjoyed that a lot. Check out the notebook here. Might do a Kalman Filter one next.