# Sid Ravinutala

## Data Scientist

• In the last two posts, we explored some features of the logit choice model. In the first, we looked at systematic taste variation and how that can be accounted for in the model. In the second, we explored one of nice benefits of the IIA assumption - we provided a random subset of alternatives of varying size to each decision maker and were able to use that to estimate the parameters.

• In the last post, we talked about how this property of Independence from Irrelevant Alternatives (IIA) may not be realistic (see red bus / blue bus example). But, say you are comfortable with it and the proportional substitution that it implies, you get to use some nice tricks.

• I’ve been working my way through Kenneth Train’s “Discrete Choice Methods with Simulation” and playing around with the models and examples as I go. Kind of what I did with Mackay’s book. This post and the next have are some key takeaways with code from chapter 3 - Logit.

• We use things without knowing how they work. Last time my fridge stopped working, I turned it off and on again to see if that fixed it. When it didn’t I promptly called the “fridge guy”. If you don’t know how things work, you don’t know when and how they break, and you definitely don’t know how to fix it.

• I just wanted to put up a few animations of HMC and slice samplers that I have been playing around with.

• In this post, I just implement a Gibbs and a slice sampler for a non-totally-trivial distribution. Both of these are vanilla version – no overrelaxation for Gibbs and no elliptical slice samplers, rectangular hyper-boxes etc. I am hoping you never use these IRL. It is a good intro though.

• Following David Mackay’s book along with his videos online have been a real joy. In lecture 11, as an example of an inference problem, he goes over many variations of the k-means algorithm. Let’s check these out.

• There are two ways of learning and building intuition. From the top down, like fast.ai believes, and the bottom up, like Andrew Ng’s deep learning course on coursera. I’m not sure what mine preferred strategy is.

• Last month, I did a post on how you could setup your HMM in pymc3. It was beautiful, it was simple. It was a little too easy. The inference button makes setting up the model a breeze. Just define the likelihoods and let pymc3 figure out the rest.

• A colleague of mine came across an interesting problem on a project. The client wanted an alarm raised when the number of problem tickets coming in increased “substantialy”, indicating some underlying failure. So there is a some standard rate at which tickets are raised and when something has failed or there is serious problem, a tonne more tickets are raised. Sounds like a perfect problem for a Hidden Markov Model.

• If you want to measure the causal effect of a treatment what you need is a counterfactual. What would have happened to the units if they had not got the treatment? Unless your unit is Gwyneth Paltrow in Sliding Doors, you only observe one state of the world. So the key to causal inference is to reconstruct the untreated state of the world. Athey et al. in their paper show how matrix completion can be used to estimate this unobserved counterfactual world. You can treat the unobserved (untreated) states of the treated units as missing and use a penalized SVD to reconstruct these from the rest of the dataset. If you are familiar with the econometric literature on synthetic controls, fixed effects, or unconfoundedness you should definitely read the paper; it shows these as special cases of matrix completion with the missing data of a specific form. Actually, you should read the paper anyway. Most of it is quite approachable and it’s very insightful.

• David McKay’s Information Theory, Inference, and Learning Algorithms, in addition to being very well written and insightful, has exercises that read like a book of puzzles. Here’s one I came across in chapter 2:

• Hat tip to @mkessler_DC for the clickbaitey title.

• I’ve been reading Efron & Hastie’s Computer Age Statistical Inference (CASI) in my downtime. Actually, I’m doing better than reading. I don’t know why I didn’t think of this earlier - the best way to truly understand the material is to have your favourite statistical package open and actually play around with the examples as you go.

• I have been slowly working my way through Efron & Hastie’s Computer Age Statistical Inference (CASI). Electronic copy is freely available and so far it has been a great though at time I get lost in the derivations.

• This posts gives the Fader and Hardie (2005) model the full Bayesian treatment. You can check out the notebook here.

• In chapter 2 of BDA3, the authors provide an example where they regularize the cancer rates in counties in the US using an empirical Bayesian model. In this post, I repeat the exercise using county level data on suicides using firearms and other means.

• Anyone else feel that US mass shootings have increased over the past few years? My wife thinks that it’s just availability heuristic at play. Well, luckily there is data out there that we can use to test it. This analysis in this blog uses the dataset from Mother Jones. I did some minor cleaning that you can see in the notebook.

• I did a quick intro to gaussian processes a little while back. Check that out if you haven’t.

• This is an implementation of SDGR based on this paper by Loshchilov and Hutter. Though the cosine annealing is built into PyTorch now which handles the learning rate (LR) decay, the restart schedule and with it the decay rate update is not (though PyTorch 0.4 came out yesterday and I haven’t played with it yet). The notebook that generates the figures in this can be found here.

• This post is an intro to Gaussian Processes.

• I recently had to create a bunch of maps for work. I did a bunch in d3.js a while back for India for CEA’s office and some (in non-interactive form) were included in the Indian Economic Survey.

• I imagine most of you have some idea of Monte Carlo (MC) methods. Here we’ll try and quantify it a little bit.

• Check out part 1 and part 2. Let’s start off by writing the code for the Metropolis algorithm and comparing it to Simulated Annealing.

• A lot of this material is from Larry Wasserman’s All of Statistics. I love how the title makes such a bold claim and then quickly hedges by adding the subtitle “A Concise Course in Statistical Inference” (The italic are mine).

• If you didn’t see Part 1, check that out first.

• I was going to dive straight into it but thought I should go over Simulated Annealing (SA) first before connecting them. SA is an heuristic optimization algorithm to find the global minimum of some complex function $f(X)$ which may have a bunch of local ones. Note that $X$ can be vector of length N: $X = [x_1, x_2, …, x_n]$