This posts gives the Fader and Hardie (2005) model the full Bayesian treatment. You can check out the notebook here.
Here’s the first paragraph from the introduction introducing the problem:
Faced with a database containing information on the frequency and timing of transactions for a list of customers, it is natural to try to make forecasts about future purchasing. These projections often range from aggregate sales trajectories (e.g., for the next 52 weeks), to individuallevel conditional expectations (i.e., the best guess about a particular customer’s future purchasing, given information about his past behavior). Many other related issues may arise from a customerlevel database, but these are typical of the questions that a manager should initially try to address. This is particularly true for any firm with serious interest in tracking and managing “customer lifetime value” (CLV) on a systematic basis. There is a great deal of interest, among marketing practitioners and academics alike, in developing models to accomplish these tasks.
They construct a betageometric model to model the number of repeat transactions for a customer.
The model
All this is in the paper so I’ll go over it quickly. Here are the modeling assumptions:

If active, time between customer $i$’s transactions are a Poisson process:\
\[t_j  t_{j1} \sim Poisson(\lambda_i)\] 
Each customer has her own $\lambda$ but it follows a gamma distribution\
\[\lambda \sim Gamma(\alpha, \beta)\] 
After (a key difference from the Pareto/NBD model) any transaction, customer $i$ can go inactive with probability $p_i$. So after
\[P(\text{i is inactive after j transactions}) = p_i(1  p_i)^{j1}\] 
Each customer has her own $p$ but it follows a beta distribution\
\[p \sim Beta(a, b)\]
Likelihood
F&H derive the following likelihood (eq. 3 in their paper):
\[L(\lambda, p  X=x, T) = (1  p)^x\lambda^x e^{\lambda T} + \delta_{x>0} p(1p)^{x1}\lambda_{x}e^{\lambda t_x}\]If you try to implement this, you’ll quickly run into numerical issues. So let’s clean this up a bit:
\[\begin{aligned} L(\lambda, p  X=x, T) &= (1  p)^x\lambda^x e^{\lambda T} + \delta_{x>0} p(1p)^{x1}\lambda_{x}e^{\lambda t_x}\\ L(\lambda, p  X=x, T) &= (1  p)^x\lambda^x e^{\lambda t_x} (e^{\lambda(T  t_x)} + \delta_{x>0} \frac{p}{1p})\\ \end{aligned}\]Taking logs to calculate the loglikelihood:
\[l(\lambda, p  X=x, T) = x log(1  p) + x log(\lambda)  \lambda t_x + log(e^{\lambda (T  t_x)} + \delta_{x>0}e^{log(\frac{p}{1p})})\]Now that last term can be written using ‘logsumexp’ in theano to get around the numerical issues. Here’s an explanation for how it works. Numpy also has an implementation of this function.
PYMC3 model
You can get the data from here. I couldn’t find their test set online (for 39  78 weeks). Let me know if you find it.
Load the data
Setup and run model
We need write our own custom density using the equation above. We’re working with tensors.
and let’s run it:
Hyperparameters
Remember that each customer $i$ has her own $p_i$ and $\lambda_i$. So let’s look at what the hyperparameters look like:
The mean value for each of these hyperparameters is what Fader and Hardie get but we have the entire distribution.
Model Checks
You should do the GelmanRubin and Geweke tests at the least to make sure our model has converged. I won’t do it here but it’s in the notebook.
Posterior predictive checks
We can be pretty confident this model is right but it’s a good habit to do some posterior predictive checks so let’s do them anyway,
Usually we’d just run pm.sample_ppc
and get some posterior predictives but that won’t work for us since we have a custom likelihood. Fader and Hardie derive $E(X(t)\vert\lambda,p)$ in equation 7. We can just use that formula to calculate the posterior predictives and see how well the models match the data observed.
Note that we factor out $\frac{1}{p}$ so allows us to use np.expm1 to avoid any overflow/underflow numerical issues. Theano and pymc3 also provide this function in case you need it while constructing a model.
Let’s see what this looks like. You can run this code multiple times to get a different random set of 16 customers and convince yourself that the model looks good:
This gives us the following plots.
Looks pretty good to me though there seem to be some shrinkage for the large values. The data points (the line) are in the high probability region of the distribution of $E(X(t)\vert\lambda, p)$.
Why Bayesian
The mean values for $\alpha, r, a, b$ are around what Fader and Hardie find in their paper. So why do we bother setting up a full Bayesian model? The main advantage is that we now have a distribution so we can add some credible intervals when reporting the expected number of transactions for a customer. Also knowing the entire distribution, allows a business to design more effective targeting.
The cost is speed. Fader and Hardie use the optimizer in Excel to maximize the likelihood. In Python that would take you seconds and you have your pick of the latest optimizers. Running MCMC to draw samples from the posterior is slow. One option around it is to use variational inference to get approximations (check out the end of the notebook).
There are also other ways to get around this problem. You could use a “coreset”. Or you could assume hypers don’t change and just do a forward pass with new $x$ and $t_x$ and rerun the full model once every few months.