Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Model Checks, Model Selection, Multivariate Distributions

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import preliz as pz
from tqdm.auto import tqdm
import pymc as pm
plt.rcParams["figure.figsize"] = (15,3)
plt.style.use('ggplot')
np.random.seed(1337) # for consistency

Exercise 1: Negative Binomial Distribution Derivation

In this exercise you will reproduce the derivation of the negative binomial distribution.

a)

The negative binomial distribution gives the probability distribution

p(kr,p)=(k+r1k)(1p)kprp(k|r,p)=\binom{k+r-1}{k}(1-p)^{k}p^{r}

over how many kk failures are needed before rr successes are reached in a binomial experiment, with expectation and variance

E[kr,p]=r(1p)p,Var[kr,p]=r(1p)p2.E[k|r,p]=\frac{r(1-p)}{p}, \quad Var[k|r,p]=\frac{r(1-p)}{p^{2}}.

Show that by setting E[kr,p]λE[k|r,p]\equiv\lambda the probability distribution becomes

p(kr,λ)=(r+k1)!(r1)!(r+λ)kλkk!1(1+λr)r,p(k|r,\lambda)=\frac{(r+k-1)!}{(r-1)!(r+\lambda)^{k}}\frac{\lambda^{k}}{k!}\frac{1}{(1+\frac{\lambda}{r})^{r}},

with Var[kr,λ]=λ(1+λr)Var[k|r,\lambda]=\lambda(1+\frac{\lambda}{r}).

Hint: Use the definition of the binomial coefficient via factorials.

No solution, you should check by yourself whether you arrive at the correct result (see lecture slides).

b)

Show that

limrp(kr,λ)=Pois(kλ)\lim_{r\rightarrow\infty}p(k|r,\lambda)=Pois(k|\lambda)

Hint: See lecture slides.

The Poisson distribution is

Pois(kλ)=λkeλk!\text{Pois}(k|\lambda) = \frac{\lambda^k \, e^{-\lambda}}{k!}

For converging to the Poisson distribution for rr\to\infty, the following properties need to be shown:

  • limr(1+λr)r=eλ\displaystyle{\lim_{r\to\infty} \left(1+\frac{\lambda}{r}\right)^r = e^\lambda} (known property of the exponential function, see e.g. Wikipedia)

  • (r+k1)!(r1)!=12(r1)r(r+1)(r+k1)12(r1)=(r+1)(r+k1)\displaystyle{\frac{(r+k-1)!}{(r-1)!} = \frac{1 \cdot 2 \cdot \dots \cdot (r-1) \cdot r \cdot (r+1) \cdot \dots \cdot (r+k-1)}{1 \cdot 2 \cdot \dots \cdot (r-1)} = (r+1) \cdot \dots \cdot (r+k-1)}

  • limrr(r+1)(r+k1)(r+λ)k)=rkrk=1\displaystyle{\lim_{r\to\infty} \frac{r \cdot (r+1) \cdot \dots \cdot (r+k-1)}{(r+\lambda)^k)} = \frac{r^k}{r^k} = 1}

c) Is it a conjugate family?

The Poisson likelihood forms a conjugated family with a Gamma prior distribution. Does the negative binomial distribution also form a conjugated family with a Gamma-distributed prior?

Negative binomial likelihood (use α\alpha instead of rr because rr is already used in prior):

kNegBin(α,λ)λk(1+λα)αk \sim \text{NegBin}(\alpha, \lambda) \propto \frac{\lambda^k}{\left(1+\frac{\lambda}{\alpha}\right)^\alpha}

Gamma prior:

λGamma(s,r)λs1erλ\lambda \sim \text{Gamma}(s, r) \propto \lambda^{s-1} \, e^{-r\lambda}

Resulting posterior:

p(λ,αk)p(kλ)p(λ)λk(1+λα)αλs1erλ=λs+k+1erλ(1+λα)αp(\lambda, \alpha|k) \propto p(k|\lambda) \, p(\lambda) \propto \frac{\lambda^k}{\left(1+\frac{\lambda}{\alpha}\right)^\alpha} \lambda^{s-1} \, e^{-r\lambda} = \frac{\lambda^{s+k+1} \, e^{-r\lambda}}{\left(1+\frac{\lambda}{\alpha}\right)^\alpha}

This is not equal to the kernel of the Gamma distribution and consequently the negative binomial likelihood and the Gamma prior are not conjugated. We necessarily need to run MCMC to get samples from the posterior.

Exercise 2: More Toilet Paper Data

After your preliminary A/B study (previous exercise sheet), you collected a whole year of weekly data (toilet paper weight in kg) in order to give more precise predictions to the facility management so that they can plan their (reduced) budget better. The data are stored in toilet_paper_data.npy (you may load them with np.load()).

y_obs = np.load( "toilet_paper_data.npy" )
y_obs
array([165.5, 138.7, 139.7, 133.1, 156.4, 118.4, 166.9, 136.9, 149.8, 143. , 163.5, 121.3, 142.1, 141.4, 159.6, 132.8, 143.9, 135.5, 146.5, 153. , 132.8, 159.7, 156.8, 152. , 156.8, 137.8, 144.5, 134.8, 142.8, 152.4, 137.7, 141.2, 137.8, 135.9, 137.9, 145.8, 132.6, 148.8, 165.9, 154.9, 220. , 135.3, 137. , 166.3, 146.6, 138.4, 148.3, 171.2, 147.4, 153.4, 149.6, 141.8, 132.3])

a) Simulate the posterior distribution

Like last time, you use a Normal likelihood yN(μ,σ2)y\sim N(\mu,\sigma^{2}) and assume weak empirical Bayes priors for μ\mu and σ\sigma: μN(y,σ^)\mu\sim N(\overline{y},\hat{\sigma}) and σExp(1/σ^),\sigma\sim Exp(1/\hat{\sigma}), where y\overline{y} is the empirical mean and σ^\hat{\sigma} the empirical standard deviation of your data yy.

Simulate the posterior distribution P(μ,σy)P(\mu,\sigma|y) with PyMC and visualize the marginal distributions P(μy)P(\mu|y) and P(σy)P(\sigma|y). What are the 90% highest density intervals for μ\mu and σ\sigma?

Empirical mean and standard deviation:

ybar = np.mean(y_obs)
sigmahat = np.std(y_obs, ddof=1)

Simulation:

with pm.Model() as tp_model_normal:
    μ = pm.Normal( 'μ', mu=ybar, sigma=sigmahat )
    σ = pm.Exponential( 'σ', lam=1/sigmahat )
    y = pm.Normal( 'y', mu=μ, sigma=σ, observed=y_obs )
    trace_normal = pm.sample( 1000 )
Loading...
Loading...

Marginal distributions and HDIs:

pm.plot_posterior( trace_normal, hdi_prob=0.9 )
array([<Axes: title={'center': 'μ'}>, <Axes: title={'center': 'σ'}>], dtype=object)
<Figure size 3450x345 with 2 Axes>
pm.summary( trace_normal, hdi_prob=0.9 )
Loading...

With 90% plausibility, μ\mu is between 143-150 and σ\sigma is between 13 and 18 (values might change slightly after running the simulation with a different seed). After a full year of data, there is quite some (aleatoric) uncertainty remaining!

b) Visualize the Joint Distribution

Use pm.plot_pair() to also visualize the full joint distribution in a contour plot. Does the posterior look more or less (multivariate) normally distributed? Use np.cov() to compute the covariance matrix of the multivariate normal distribution. Are there strong correlations between μ\mu and σ\sigma in the posterior? Hint: np.cov() requires a numpy array with each variable (samples for μ\mu and samples for σ\sigma) as a column.

pm.plot_pair( trace_normal, kind="kde", figsize=(5,5), marginals=True )
array([[<Axes: >, None], [<Axes: xlabel='μ', ylabel='σ'>, <Axes: >]], dtype=object)
<Figure size 500x500 with 3 Axes>

Looks normally distributed! Note that variances are not equal and you should actually see an ellipse if you turn on equal aspect ratio.

data = np.hstack( [trace_normal.posterior.μ.values.flatten().reshape(-1,1), trace_normal.posterior.σ.values.flatten().reshape(-1,1)] )
C = np.cov( data.T )
C
array([[ 4.5756726 , -0.17370283], [-0.17370283, 2.51392661]])

Covariance matrix is almost diagonal! That means there are almost no correlations between μ\mu and σ\sigma also in the posterior!

Correlation matrix:

stdevs = np.sqrt(np.diag(C))
C / np.outer(stdevs, stdevs)
array([[ 1. , -0.0512157], [-0.0512157, 1. ]])

c) Compute samples from the predictive distribution

Compute samples from the predictive distribution using PyMC and use them to compute the root mean squared error (RMSE) and mean absolute error (MAE) averaged over your data and your predictive distribution as done in the lecture.

with tp_model_normal:
    ppc_normal = pm.sample_posterior_predictive(trace_normal)
Loading...
Loading...

Reshape predictions for broadcasting with Numpy:

ypred_normal = ppc_normal.posterior_predictive.y.values.reshape(-1,53)
ypred_normal.shape
(4000, 53)

Compute differences:

eps_normal = ypred_normal - y_obs

RMSE (in kg):

np.sqrt( np.mean( eps_normal**2 ) )
np.float64(22.001847895618823)

MAE (in kg):

np.mean( np.abs( eps_normal ) )
np.float64(16.756911889753187)

RMSE and MAE are quite different, this could indicate the presence of outliers in the data..

d) Perform a Posterior Predictive Check

You are not sure, whether using the normal distribution as likelihood (or sampling probability) was appropriate. To this end, perform a posterior predictive check as done in the lecture. Are the observed data within the range of typical predictions by the model? If not, what tendencies do you observe? What might be the reason for the discrepancy?

pm.plot_ppc( ppc_normal )
<Axes: xlabel='y'>
<Figure size 1500x300 with 1 Axes>

The predictions (red) closely follow a normal distribution. However, the center of the observed data distribution is more to the left! There is an outlier present that can be quickly found (using e.g. a strip plot from seaborn):

sns.stripplot( y_obs )
<Axes: >
<Figure size 1500x300 with 1 Axes>

This outlier appears to pull the estimate for μ\mu a bit towards the right. This has not much influence on the estimated mean, but quite some on the estimated standard deviation:

np.std( y_obs, ddof=1 ), np.std( y_obs[y_obs<200], ddof=1 )
(np.float64(15.44934205108168), np.float64(11.683511702453055))

Exercise 3: Robust Estimation with Student’s T Distribution

(This exercise is a continuation of Exercise 1)

After asking back, the facility management tells you that there was a large conference with many external people in the building on one day (where you have the outlier in your data). You consider two different alternatives:

  1. Remove the outlier and re-use the same normal model.

  2. Keep the outlier and use a model that is capable of taking outliers into account.

Bayesians often prefer to be subjective in their choice of model but rather like to stay objective in their data. Modifications of data are always dangerous, since they are more difficult to document or reproduce than changes to the model.

a) Visualize the Student’s t distribution

You choose a Student’s t distribution as likelihood, since it is known to be more robust against outliers than the normal distribution. In addition to mean μ\mu and standard deviation σ\sigma, the Student’s t distribution also has a parameter ν\nu. In statistical t-tests, ν\nu has typically a non-trivial interpretation as number of degrees of freedom. Here we take a simpler path and use ν\nu to set the Student’s t distribution in relation with the normal distribution.

Use PreliZ visualisation functionality to compare different Student’s t distributions with the normal distribution. In particular, set μ=0\mu=0 and σ=1\sigma=1 and use ν{2,3,5,10,20}\nu\in\{2,3,5,10,20\}. For what limit of ν\nu is the Student’s t distribution equal to the normal distribution? In what differs the Student’s t distribution from the normal distribution? Observe in particular also the tails of the different distributions.

Hint: Use the functions pz.StudentT(mu, sigma, nu).plot_pdf() and pz.Normal(mu, sigma).plot_pdf() and make sure the figure has an appropriate size (e.g. figsize=(8,6)) and region of interest (e.g. plt.xlim(-4,4)) so that the differences are clearly visible.

plt.figure( figsize=(8,6) )
for nu in [2,3,5,10,20]:
    pz.StudentT( mu=0, sigma=1, nu=nu ).plot_pdf()
pz.Normal( mu=0, sigma=1 ).plot_pdf( color="black" )
plt.xlim([-4,4])
(-4.0, 4.0)
<Figure size 800x600 with 1 Axes>

The Student’s tt distribution converges towards the normal distribution for large values of ν\nu (so in theory for ν\nu \to \infty). For smaller values of ν\nu, the Student’s tt distribution has fatter tails than the normal distribution.

b) Modeling with Student’s t distribution

Repeat your modelling efforts from Exercise 1 using a Student’s t distribution as likelihood instead of a normal distribution. To this end, use a Gamma(s=10,r=1.1)Gamma(s=10, r=1.1) prior distribution for ν\nu (that you should visualize beforehand). What are the 90% highest density intervals for μ\mu, σ\sigma and ν\nu? From looking at ν\nu, do you expect the resulting Student’s t likelihood to have much fatter tails than the previously used normal likelihood?

Prior for ν\nu:

pz.Gamma(10,1.1).plot_pdf()
<Axes: >
<Figure size 1500x300 with 1 Axes>

Simulation:

with pm.Model() as tp_model_studentt:
    μ = pm.Normal( 'μ', mu=ybar, sigma=sigmahat )
    σ = pm.Exponential( 'σ', lam=1/sigmahat )
    ν = pm.Gamma( 'ν', alpha=10, beta=1.1 )
    y = pm.StudentT( 'y', mu=μ, sigma=σ, nu=ν, observed=y_obs )
    trace_studentt = pm.sample( 1000 )
Loading...
Loading...
pm.plot_posterior( trace_studentt, figsize=(15,3), hdi_prob=0.9 )
array([<Axes: title={'center': 'μ'}>, <Axes: title={'center': 'σ'}>, <Axes: title={'center': 'ν'}>], dtype=object)
<Figure size 1500x300 with 3 Axes>

There is slightly less uncertainty in the mean μ\mu and considerable less uncertainty in σ\sigma (that was by definition heavily influenced by the outlier)! ν\nu is with 90% plausibility between 4 and 11, a considerable difference to a normal distribution with fatter tails:

plt.figure( figsize=(10,5) )
for nu in [4,6,8,11]:
    pz.StudentT( mu=0, sigma=1, nu=nu ).plot_pdf()
pz.Normal( mu=0, sigma=1 ).plot_pdf( color="black" )
plt.xlim([-4,4])
(-4.0, 4.0)
<Figure size 1000x500 with 1 Axes>

The fatter tails allow to accomodate the outlier there, while the normal distribution tries to bring the outlier more within its center.

c) Posterior predictive checks

Repeat the posterior predictive test that you did in Exercise 1 for your new model and compare it to the old one with normal likelihood. How do they differ?

with tp_model_studentt:
    ppc_studentt = pm.sample_posterior_predictive(trace_studentt)
Loading...
Loading...
fig, ax = plt.subplots(1, 2, sharey=True )
plt.sca( ax[0] )
pm.plot_ppc( ppc_normal, ax=ax[0] )
plt.xlim(80, 230)
plt.title("Normal likelihood")
plt.sca( ax[1] )
pm.plot_ppc( ppc_studentt, ax=ax[1] )
plt.xlim(80, 230)
plt.title("Student's t likelihood")
<Figure size 1500x300 with 2 Axes>

Student’s tt likelihood seems to make predictions quite a bit closer to the data distribution!

d) Compute RMSE and MAE

Compute again RMSE and MAE. What are the differences to Exercise 1 where a normally distributed likelihood was assumed? Were you expecting smaller or larger values compared to Exercise 1? What could be the reason that RMSE and MAE are not as much smaller as you wish them to be?

ypred_studentt = ppc_studentt.posterior_predictive.y.values.reshape(-1,53)
eps_studentt = ypred_studentt - y_obs

RMSE:

np.sqrt( np.mean( eps_studentt**2 ) )
np.float64(20.834522351330023)

MAE:

np.mean( np.abs( eps_studentt ) )
np.float64(15.390502135330104)

RMSE and MAE are now of almost equal size. However they are only a bit smaller. Choosing a more complex distribution (3 instead of 2 parameters) comes with more uncertainty in the parameters and consequently more predictive uncertainty for the same amount of data. Even though we have reduced the bias of the model (‘wrong’ normal likelihood assumption), we have now increased its variance by introducing an additional parameter.

e) Compare models

Compare both models (normal vs Student’s t likelihood) in terms of expected log predictive density (ELPD). Can you clearly favour one model of the other?

with tp_model_normal:
    pm.compute_log_likelihood(trace_normal)
loo_normal = pm.loo( trace_normal )
Loading...
Loading...

(this warning is justified!!)

with tp_model_studentt:
    pm.compute_log_likelihood(trace_studentt)
loo_studentt = pm.loo( trace_studentt )
Loading...
Loading...
df_comp_loo = pm.compare({'Normal Likelihood': loo_normal, 'Student\'s t Likelihood': loo_studentt})
df_comp_loo
Loading...
pm.plot_compare(df_comp_loo);
<Figure size 600x200 with 1 Axes>

In terms of ELPD, the model with the Student’s tt likelihood is not significantly better (by the one-standard-error rule) than the model with the normal likelihood. This would probably change with more data and is due to the additional uncertainties (e.g. in the estimation of ν\nu).

Exercise 4: Railway Switch Maintenance

For resource planning (personnel and pre-stored material), SBB wants to assess how often their railway switches turn up defective and have to be maintained. To this end, they collected for 10 years the weekly counts of defective railway switches (fictitious data, stored in railway_switch_data.npy).

In this exercise you want to assess whether it is better to use a Poisson or a negative binomial likelihood for your Bayesian model. Since you know that models with more parameters typically pay a price in terms of additional uncertainty, you proceed very carefully.

y_obs = np.load( "railway_switch_data.npy" )

a) Run posterior simulations

Run posterior simulations for two models:

  1. A model with Poisson likelihood and a Normal(y,σ^)Normal(\overline{y}, \hat{\sigma}) prior on λ\lambda, where y\overline{y} is the empirical data mean and σ^\hat{\sigma} the empirical standard deviation.

  2. A model with negative binomial likelihood with the same prior on λ\lambda and a Gamma(2,0.1)Gamma(2,0.1) prior on rr.

Quickly visualize the priors used before the simulation—do they make sense for you? Visualize the marginal posterior distributions of the involved parameters and assess their uncertainty. Hint: PyMC does not use exactly the same parameter names for the prior distributions as used here and in the lecture. Make sure you find the correct mappings.

Poisson likelihood model:

pz.Normal( mu=np.mean(y_obs), sigma=np.std(y_obs) ).plot_pdf()
<Axes: >
<Figure size 1500x300 with 1 Axes>
with pm.Model() as poisson_model:
    lbd = pm.Normal( 'lbd', mu=np.mean(y_obs), sigma=np.std(y_obs) )
    y = pm.Poisson( 'y', mu=lbd, observed=y_obs )
    trace_poisson = pm.sample( 1000, chains=4 )
Loading...
Loading...
pm.plot_posterior( trace_poisson )
<Axes: title={'center': 'lbd'}>
<Figure size 1500x300 with 1 Axes>

Negative binomial likelihood model:

pz.Gamma( alpha=2, beta=0.1 ).plot_pdf()
<Axes: >
<Figure size 1500x300 with 1 Axes>
with pm.Model() as negbin_model:
    lbd = pm.Normal( 'lbd', mu=np.mean(y_obs), sigma=np.std(y_obs) )
    r = pm.Gamma( 'alpha', alpha=2, beta=0.1 )
    y = pm.NegativeBinomial( 'y', mu=lbd, alpha=r, observed=y_obs )
    trace_negbin = pm.sample( 1000, chains=4 )
Loading...
Loading...
pm.plot_posterior( trace_negbin )
array([<Axes: title={'center': 'lbd'}>, <Axes: title={'center': 'alpha'}>], dtype=object)
<Figure size 3450x345 with 2 Axes>
pm.plot_pair( trace_negbin, kind="kde", figsize=(6,6) )
<Axes: xlabel='lbd', ylabel='alpha'>
<Figure size 600x600 with 1 Axes>

Again the joint posterior distribution looks normally distributed!

b) Perform a posterior predictive check

Perform a posterior predictive check of both models and plot the side-by-side. Which model do you prefer after the posterior predictive check?

Posterior predictions:

with poisson_model:
    ppc_poisson = pm.sample_posterior_predictive(trace_poisson)

with negbin_model:
    ppc_negbin = pm.sample_posterior_predictive(trace_negbin)
Loading...
Loading...
Loading...
Loading...

Posterior predictive checks:

fig, ax = plt.subplots(1, 2, sharey=True )
plt.sca( ax[0] )
pm.plot_ppc( ppc_poisson, ax=ax[0] )
plt.title("Poisson likelihood")
plt.sca( ax[1] )
pm.plot_ppc( ppc_negbin, ax=ax[1] )
plt.title("Negative binomial likelihood")
<Figure size 1500x300 with 2 Axes>

The negative binomial likelihood seems much better suited to model the data!

c) Compute Bayes factors

Compute the Bayes factor between the marginal likelihoods for the two models. Which model explains the data better in your belief? Also try to compute the marginal likelihoods p(dMPois)p(d|\mathcal{M}_{Pois}) and p(dMNegBin)p(d|\mathcal{M}_{NegBin}). Would you expect that they are so extremely small?

Hint: Use SMC as introduced in the lecture and the companion notebook.

with pm.Model() as poisson_model_smc:
    lbd = pm.Normal( 'lbd', mu=np.mean(y_obs), sigma=np.std(y_obs) )
    y = pm.Poisson( 'y', mu=lbd, observed=y_obs )
    trace_poisson_smc = pm.sample_smc( 2000, chains=4 )
Loading...
Loading...
with pm.Model() as negbin_model_smc:
    lbd = pm.Normal( 'lbd', mu=np.mean(y_obs), sigma=np.std(y_obs) )
    r = pm.Gamma( 'alpha', alpha=2, beta=0.1 )
    y = pm.NegativeBinomial( 'y', mu=lbd, alpha=r, observed=y_obs )
    trace_negbin_smc = pm.sample_smc( 1000, chains=4 )
Loading...
Loading...
lpd1 = trace_poisson_smc.sample_stats.log_marginal_likelihood.mean().values
lpd2 = trace_negbin_smc.sample_stats.log_marginal_likelihood.mean().values
BF = np.exp( lpd2 - lpd1 )
lpd1, lpd2, BF
(array(-1946.6099806), array(-1757.12056159), np.float64(1.968833791064198e+82))

It is beyond any discussion that the negative binomial model is better suited (BF=1082..)

The marginal probabilities are extremely small!

np.exp(lpd1)
np.float64(0.0)
np.exp(lpd2)
np.float64(0.0)

This is the reason why we work with log probabilities, to at least being able to evaluate the Bayes factor. The marginal probabilities are so small because they are the product of the probabilities of all data points under the posterior distribution:

p(dMi)=k=1Np(ykMi)p(d|\mathcal{M}_i) = \prod_{k=1}^N p(y_k|\mathcal{M}_i)

Imagine each data point has on average 80% probability under the model (which is high). Then the resulting marginal probability is

0.8**len(y_obs)
4.342032859091343e-52

extremely small values are to be expected!

d) Compare models

Compare the models in terms of ELPD. Is there a clear winner when involving the one-standard-error rule?

with poisson_model:
    pm.compute_log_likelihood(trace_poisson)
loo_poisson = pm.loo( trace_poisson )
Loading...
Loading...
with negbin_model:
    pm.compute_log_likelihood(trace_negbin)
loo_negbin = pm.loo( trace_negbin )
Loading...
Loading...
df_comp_loo = pm.compare({'Poisson Likelihood': loo_poisson, 'Negative Binomial Likelihood': loo_negbin})
df_comp_loo
Loading...
pm.plot_compare(df_comp_loo);
<Figure size 600x200 with 1 Axes>

Clearly it is recommended to use the negative binomial model! This time we have enough data to reduce the uncertainty of the additional parameter.

e) Compute predictive RMSE and MAE

Compute predictive RMSE and MAE for both models and compare them.

ypred_poisson = ppc_poisson.posterior_predictive.y.values.reshape(-1,530)
ypred_negbin = ppc_negbin.posterior_predictive.y.values.reshape(-1,530)
eps_poisson = ypred_negbin - y_obs
eps_negbin = ypred_poisson - y_obs

RMSE:

rmse_poisson = np.sqrt( np.mean( eps_poisson**2 ) )
rmse_negbin = np.sqrt( np.mean( eps_negbin**2 ) )
rmse_poisson, rmse_negbin
(np.float64(9.658274558782754), np.float64(7.9643295267913645))

MAE:

mae_poisson = np.mean( np.abs( eps_poisson ) )
mae_negbin = np.mean( np.abs( eps_negbin ) )
mae_poisson, mae_negbin
(np.float64(7.595762735849057), np.float64(6.296092924528302))

Both negative binomial RMSE and MAE are a bit smaller!

Exercise 5: Estimating Churn Proportions

You work at a national health insurance company. Recently, significantly more customers than usual have terminated their insurance policies with your company and changed to somebody else. In the insurance jargon, the number of customers terminating their contract is called churn. The management has ordered an investigation that was carried out by the sales department: they have tried to contact 200 customers that have left and asked them about the reason why they had terminated their contract. They could actually reach 150 customers and out of these 112 gave a suitable answer. Finally, the sales departement has put the different churn reasons into three main categories:

  1. premium (‘prämie’) too high (k1=82k_{1}=82 cases)

  2. unsufficient coverage (k2=24k_{2}=24 cases)

  3. unsuitable insurance models (k3=6k_{3}=6 cases)

As a data scientist, you are aware that this is a restricted sample and decide to use a Bayesian model with multinomial likelihood and flat Dirichlet prior to model the true underlying proportions π1\pi_{1}, π2\pi_{2} and π3\pi_{3} of your entire insurant population including uncertainty.

a) Simulate the posterior

Simulate the posterior with PyMC and visualize the marginal posterior distributions of π1\pi_{1}, π2\pi_{2} and π3\pi_{3}. What 90% HDIs can you communicate back to the management? Formulate a sentence.

observed_counts = [82, 24, 6]

rng = np.random.default_rng(123)

with pm.Model() as churn_model:
    pi = pm.Dirichlet( 'pi', a=[1,1,1] )
    y = pm.Multinomial( 'y', n=np.sum(observed_counts), p=pi, observed=observed_counts )
    trace = pm.sample(1000, random_seed=rng)
Loading...
Loading...
pm.plot_posterior( trace, figsize=(15,2), hdi_prob=0.8 );
<Figure size 1500x200 with 3 Axes>

I believe to 90% that for around 67-78% of the churn customers the premium is too high, for around 17-27% the coverage was unsufficient and around 0.03-0.09% found our insurance models unsuitable.

b) Use the Dirichlet-multinomial conjugacy

Use Dirichlet-multinomial conjugacy and Preliz to visualize your posterior on a probability simplex.

alpha = np.array([1,1,1])
pz.Dirichlet( alpha + observed_counts ).plot_pdf( marginals=False )
<Axes: title={'center': '$\\bf{Dirichlet}$(alpha=[83. 25. 7.])'}>
<Figure size 1500x300 with 1 Axes>

c) Switch to a different prior

Your superior is unsure about your results. She says: “It’s very typical that people switch contracts because of the premium. I would have used a Dirichlet ([10,1,1]) prior that reflects this.”

Use Preliz to visualize this prior on a simplex and marginalized (to look at π1\pi_{1} specifically) and compute the prior mean and standard deviation for π1\pi_{1}. Finally, re-run your simulation and show that your results are still pretty similar and that your posterior is quite insensitive to this choice of prior.

pz.Dirichlet( [10,1,1] ).plot_pdf( marginals=False )
<Axes: title={'center': '$\\bf{Dirichlet}$(alpha=[10. 1. 1.])'}>
<Figure size 1500x300 with 1 Axes>
pz.Dirichlet( [10,1,1] ).plot_pdf()
array([<Axes: >, <Axes: >, <Axes: >], dtype=object)
<Figure size 1200x400 with 3 Axes>
pz.Dirichlet( [10,1,1] ).summary()
Dirichlet(mean=array([0.83333333, 0.08333333, 0.08333333]), std=array([0.10336228, 0.07665552, 0.07665552]))
with pm.Model() as churn_model:
    pi = pm.Dirichlet( 'pi', a=[10,1,1] )
    y = pm.Multinomial( 'y', n=np.sum(observed_counts), p=pi, observed=observed_counts )
    trace = pm.sample(1000)
Loading...
Loading...
pm.plot_posterior( trace, figsize=(15,2), hdi_prob=0.8 );
<Figure size 1500x200 with 3 Axes>

90% HDIs are only changed by a few percent! This will not have a big impact on my communication.

Exercise 6: The Multivariate Normal Distribution

Even though the multivariate normal distribution plays a big role in Bayesian inference, it will not play a major role in the first part of this lecture, since the bambi library does everything for us in hiding complexity. However, the multivariate normal distribution will become more important, when you will work with Gaussian processes in the second half.

The following three exercises are meant to get to know the multivariate normal distribution a little bit better.

a) The multivariate normal distribution

Load the credit rating data (from credit_data.csv in the exercise materials) and select ‘Income’ as xx and ‘Rating’ as yy. Compute (classical) estimators for μ=(μxμy)\mu=(\begin{matrix}\mu_{x}\\ \mu_{y}\end{matrix}) and the covariance matrix Σ\Sigma. Visualize the resulting normal distribution N(μ^,Σ^)N(\hat{\mu},\hat{\Sigma}) using a contour plot and overplot with the data. Ponder (qualitatively) whether the normal distribution is a good model for this data.

Hints: You may use np.cov() to compute the covariance and code in the notebook ‘week5_multivariate_distributions.ipynb’ to produce the contour plot.

credit_data = pd.read_csv("data/credit_data.csv")
credit_data.head()
Loading...

Compute means and covariance:

mu = credit_data[["Income", "Rating"]].mean().values
mu
array([ 45.218885, 354.94 ])
cov = np.cov( credit_data[["Income", "Rating"]].values.T, ddof=1 )
cov
array([[ 1242.15879093, 4315.49294045], [ 4315.49294045, 23939.56030075]])

Plot contour and data (removed equal axis dimensions to make the plot a bit more appealing):

# from notebook "week5_multivariate_distributions.ipynb"
def plot_mvnormal( mu, cov, xrange=[-3,3], yrange=[-3,3], ax=None ):
    
    # setup grid
    x = np.linspace(xrange[0], xrange[1], 1000)
    y = np.linspace(yrange[0], yrange[1], 1000)
    X, Y = np.meshgrid(x, y)

    # multivariate normal
    Z = stats.multivariate_normal( mean=mu, cov=cov ).pdf( np.dstack((X, Y)) )

    # create plot
    if ax is not None:
        plt.sca(ax)

    CS = plt.contour(X, Y, Z, levels=20)
    #plt.gca().set_aspect("equal");
xmin = 0
xmax = credit_data.Income.max() * 1.1
ymin = 0
ymax = credit_data.Rating.max() * 1.1
plot_mvnormal( mu, cov, xrange=[xmin, xmax], yrange=[ymin, ymax] )
credit_data.plot.scatter(x="Income", y="Rating", c="red", ax=plt.gca())
<Axes: xlabel='Income', ylabel='Rating'>
<Figure size 1500x300 with 1 Axes>

b)

Show that if x1N(μ1,σ12)x_{1}\sim N(\mu_{1},\sigma_{1}^{2}) and x2N(μ2,σ22)x_{2}\sim N(\mu_{2},\sigma_{2}^{2}), then

x1+x2N(μ1+μ2,σ12+σ22)x_{1}+x_{2}\sim N(\mu_{1}+\mu_{2},\sigma_{1}^{2}+\sigma_{2}^{2})

Hints: Compute E[x1+x2]E[x_{1}+x_{2}] and Var[x1+x2]Var[x_{1}+x_{2}] and use linearity in the first and that Var[x1+x2]=Var[x1]+Var[x2]Var[x_{1}+x_{2}]=Var[x_{1}]+Var[x_{2}] for independent x1x_{1} and x2x_{2}.

E[x1+x2]=E[x1]+E[x2]=μ1+μ2E[x_1 + x_2] = E[x_1 ]+ E[x_2] = \mu_1 + \mu_2
Var[x1+x2]=Var[x1]+Var[x2]=σ12+σ22\text{Var}[x_1 + x_2] = \text{Var}[x_1]+ \text{Var}[x_2] = \sigma_1^2 + \sigma_2^2
x1+x2N(μ1+μ2,σ12+σ22)\Longrightarrow x_1 + x_2 \sim N(\mu_1 + \mu_2, \sigma_1^2 + \sigma_2^2)

c)

Let XN(μ,Σ)X\sim N(\mu,\Sigma) be a random vector. Show that AX+bN(Aμ+b,AΣAT).AX+b\sim N(A\mu+b,A\Sigma A^{T}).

Hints: Compute E[Aμ+b]E[A\mu+b] and use linearity. Compute Cov(Aμ+b)Cov(A\mu+b) and use that Cov(Aμ+b)=Cov(Aμ)Cov(A\mu+b)=Cov(A\mu) and Σ=Cov(X)=E[(XE[X])(XE[X])T]=E[XXT]E[X]E[X]T\Sigma=Cov(X)=E[(X-E[X])(X-E[X])^{T}]=E[XX^{T}]-E[X]E[X]^{T}.

E[AX+b]=Aμ+bE[A\,X + \mathbf{b}] = A \boldsymbol{\mu} + \mathbf{b}
Cov[AX+b]=Cov[AX]=E[(AX)(AX)T]E[AX]E[AX]T=E[AXXTATAE[X]E[X]TAT]=AE[XXT]ATAE[X]E[X]TAT=A(E[XXT]E[X]E[X]T)AT=AΣAT\begin{align*} \text{Cov}[A\,X + \mathbf{b}] &= \text{Cov}[A\,X]\\ &= E[(A\,X)\,(A\,X)^T] - E[A\,X] \, E[A\,X]^T\\ &= E[A\,X\,X^T\,A^T - A\,E[X]\,E[X]^T\,A^T ]\\ &= A \, E[X\,X^T]\, A^T - A \, E[X] \, E[X]^T \, A^T\\ &= A \, \left( E[X\,X^T] - E[X] \, E[X]^T \right) \, A^T\\ &= A \, \Sigma \, A^T \end{align*}