State space models (SSM) are a tonne of fun. I sneaked one into a this post I did a while ago. In that post, I was recreating an analysis but using a state space model where the hidden state, the true $\beta$s were following a Gaussian Random Walk and what we observed was the growth in GDP. In this post, I’m going to explore a generalised version of the model - the linear-Gaussian SSM (LG-SSM).

The notation I am following is from Chapter 18 of Murphy’s Machine Learning book.

You can check out the (messy) notebook here.

## What’s an LG-SSM

A state space model is like an HMM which I wrote about in these two blog posts. Instead of the hidden states being discrete, they are now continuous. So the model is:

\[\begin{aligned} \mathbf{z}_t &= g(\mathbf{u}_t, \mathbf{z}_{t-1}, \mathbf{\epsilon}_t)\\ \mathbf{y}_t &= h(\mathbf{z}_t, \mathbf{u}_t, \mathbf{\delta}_t) \end{aligned}\]$\mathbf{z_t}$ is our hidden state that is evolving as a function, $g$, of:

- the previous state, $\mathbf{z}_{t-1}$,
- the input, $\mathbf{u}_t$, and
- some noise $\mathbf{\epsilon}_t$.

What we observe is $\mathbf{y}_t$. This is a some function, $h$, of:

- the hidden state, $\mathbf{z}_t$,
- the input, $\mathbf{u}_t$, and
- some measurement error $\mathbf{\delta}_t$.

If we have a model where $g$ and $h$ are both linear functions and both of those error terms are Gaussian, we have **linear-Gaussian SSM (LG-SSM)**. More explicitly, we have:

and the system and observation noises are Gaussian:

\[\begin{aligned} \epsilon_t &\sim \mathcal{N}(0, \mathbf{Q}_t)\\ \delta &\sim \mathcal{N}(0, \mathbf{R}_t) \end{aligned}\]In the growth regression model, $A_t$ was 1, $C_t$ was the GDP level, and $B_t$ & $D_t$ were 0.

## Let’s simulate some data

Note a few simplifications:

- We are just using scalars but $\mathbf{z}_t$ can be multi-dimensional and therefore $Q$, $A$ can be appropriately sized square matrices. A local linear trend model is an example.
- Similarly, so can $\mathbf{u}_t$
- Further, these parameters can be changing over time.

## Kalman filtering / smoothing

The nice thing about LG-SSM is that if you know the parameters of the system, you can do exact Bayesian filtering to recover the hidden state. A lot has been written about it so I won’t go into this in too much detail here. I’ll just leave the algorithm from Murphy’s book here:

You can check out the simple implementation in the notebook. Here are the filtered values we get for $\mathbf{z}_t$:

Kalman filters tend to be quite robust to getting the parameters a bit wrong. Which is good since you don’t always know the parameters exactly. Here’s the filtering with the following parameters:

And if you are smoothing (using data filter for all time periods, $T$, to estimate the value for time $t$), then you do even better:

## PYMC3 to infer parameters

PYMC3 has some time-series methods implemented for you. If you squint, the hidden process $\mathbf{z}_t$ is pretty similar to the an $AR1$ process. The only difference is that instead of:

\[z_{t + 1} = A z_t + \epsilon_t\]we have:

\[z_{t + 1} = A z_t + B u_t + \epsilon_t\]So let’s just modify the $AR1$ implementation for this change:

That’s it. Easy. Now let’s fit the model:

That splitting of the $\sigma$ is a trick I picked up from pymc3 discourse. The total variation needs to be distributed between the two levels. We’ll introduce another parameter, $\alpha$, that determines the allocation of the variation to each.

Those divergences seems to be when $\sigma 1$ gets too small. We could try other parameterisations to shake those off. Overall, we seem to have recovered the parameters pretty well.

And here’s the posterior for the hidden state:

## PYMC3 with non-stationary parameters

An SSM where the parameters don’t change is called stationary. Here’s one where $A$ is not static but rather changes as a cosine function.

Here’s what the data look like:

### Fitting the models

The key thing we are changing is that we are modelling $A$ as Gaussian Random Walk. So $A$ changes slowly over time.

### Results

Check out the notebook. Divergences abound. Maybe you can fix it. But I just want to show you what the posterior looks like:

And most interestingly, what we learn about $A$:

Not great but generally there.

## Conclusions

When the observation error in $y_t$ and the system error in $z_t$ are both large, these models do well. If not, you can probably use a simpler model and do just fine.

### The bloopers reel

Fitting these models is tricky. I was listening to Alex Andorra’s podcast. Good podcast, check it out. They often talk about how tricky these models can be. I spent a lot of time debugging divergences and failed chains. One example is if specifically telling the model that $A$ follows a cosine of some frequency. Or if you even give it the frequency but you say it follows a sine with some phase. You’d think these model would fit better. I couldn’t figure out why it doesn’t. I also tried using `pm.Bound`

to restrict $A$ to be between -1 and 1. You’d think that would help it along. Nope - I guess gradient go funny at the edges when you use `pm.Bound`

.

In the model above, try making `C_`

and `B_`

Normals instead of a Half Normals. You see something interesting - the posterior for these values is bimodal. And your `rhat`

values are wild since your chains might be exploring different modes. And that makes sense. The hidden state can be the inverted as long as you are inverting the coefficient `C_`

as well. Obvious now in hindsight but not something I had thought about before I ran the model.

At some stage, I’d love to do a post of just my hacky failures. All those attempts to fit fancy models that came to nothing. Also, should talk about how good the pymc3 discourse it.