Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Bayesian Linear Regression

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import preliz as pz
import pymc as pm
import arviz as az
import bambi as bmb
plt.rcParams["figure.figsize"] = (8,3)
plt.style.use('ggplot')
np.random.seed(1337) # for consistency

Exercise 1: Modeling Credit Data with Simple Linear Regression

In this exercise, you will perform the basic steps of Bayesian linear regression. To this end, you will work with the credit dataset that you can load with Pandas from credit_data.csv.

credit_data = pd.read_csv("credit_data.csv")
credit_data.head()
Loading...

a)

You would like to predict the balance (current credit card debt in USD) of a person using only the limit set by the bank (in USD) as a predictor. Write down the structure of a linear regression model that achieves this using the mathematical notation from the lecture (priors, likelihood, etc.). Thereby you do not have to specify numeric values for the prior distribution parameters (but you may think about their sizes if you want).

β0N(θ0,τ02)β1N(θ1,τ12)σExp(1/l)(or something similar, as long as its only over positive values)yβ0,β1,σN(β0+β1x,σ2)\begin{align*} \beta_0 &\sim& N(\theta_0, \tau_0^2)\\ \beta_1 &\sim& N(\theta_1, \tau_1^2)\\ \sigma &\sim& \text{Exp}(1/l) \quad (\text{or something similar, as long as its only over positive values})\\ y|\beta_0, \beta_1, \sigma &\sim& N(\beta_0 + \beta_1 \, x, \sigma^2) \end{align*}

b) Fit a linear regression model with bambi

Fit a linear regression model with bambi, letting it automatically select default priors for you. Extract the model that it created both through backend.model (mathematical notation) and through graph() (graphical model). Does the model align with your own model in a)? If no, what are the differences?

credit_model1 = bmb.Model("Balance ~ Limit", data=credit_data, family="gaussian")
credit_trace1 = credit_model1.fit(draws=2000, tune=2000)
Loading...
Loading...
credit_model1.backend.model
Loading...
credit_model1.graph() # may lead to warnings or errors depending on what versions you use / what packages you have installed
Loading...

Bambi uses a half Student’s tt distribution and I proposed an exponential distribution above. Both only go over positive values and since they are weak there should not be a big difference in the resulting posterior.

c)

Return a summary of the posterior regression coefficient distributions using pm.summary() and plot them visually using pm.plot_posterior() (for both using 95% credible intervals). Write down the mean regression model as a formula. How can the coefficients be interpreted in terms of the effect of the associated predictors on credit card balance?

pm.summary( credit_trace1, hdi_prob=0.95 )
Loading...
pm.plot_posterior( credit_trace1, hdi_prob=0.95 )
array([<Axes: title={'center': 'sigma'}>, <Axes: title={'center': 'Intercept'}>, <Axes: title={'center': 'Limit'}>], dtype=object)
<Figure size 2760x345 with 3 Axes>

Mean model:

Balance=293+0.17Limit\text{Balance} = -293 + 0.17 \cdot \text{Limit}

Interpretation: For every USD by that the limit is increased, the customer has 17 cents more debts. However the model does probably not hold for small balances and limits: for a limit of 0, the balance would be -293 USD (intercept). This is probably due to non-linearity.

d)

Create a plot with data and regression line using bmb.interpret.plot_predictions(). Use a 95% HDI for the regression line uncertainty. Using the plot, give an interpretation why the recovered intercept is negative. Also, visually judge aleatoric and epistemic uncertainty.

Hint: Whether bmb.interpret.plot_predictions() plots an HDI of the posterior regression line distribution or of the predictive distribution can be controlled with the pps parameter.

credit_data.plot.scatter( x="Limit", y="Balance" )
bmb.interpret.plot_predictions( credit_model1, credit_trace1, "Limit", prob=0.95, ax=plt.gca() )
(<Figure size 800x300 with 1 Axes>, array([<Axes: xlabel='Limit', ylabel='Balance'>], dtype=object))
<Figure size 800x300 with 1 Axes>

The non-linearity in the beginning is clearly visible, leading to a compensation of the model with a negative intercept.

The epistemic uncertainty is quite small in comparison with the aleatoric certainty (unexplained variance in the data after fitting the model).

e)

How good is your model? Compute the Bayesian predictive versions of RMSE, MAE and R2R^{2} score. On this occasion, make sure that you understand the Bayesian interpretation of R2R^{2} proposed in the lecture.

credit_model1.predict(credit_trace1, kind="response")
ypred = az.extract( credit_trace1.posterior_predictive ).Balance.values.T
ytrue = credit_data.Balance.values

RMSE:

np.sqrt( np.mean( (ypred - ytrue)**2 ) )
np.float64(330.7474098511541)

Given the credit card limit, we are able to model the balance by ± 331 USD.

MAE:

np.mean( np.abs( ypred - ytrue ) )
np.float64(260.4780826723066)

MAE is quite a bit smaller, indicating the presence of points further away from the line than expected.

R2R^2:

pm.r2_score( ytrue, ypred )
r2 0.659406 r2_std 0.018883 dtype: float64

The model is doing reasonably well for just one predictor.

f)

You try to predict the balance for a customer with a limit of 3000 USD. Compute the predictive distribution and summarize and visualize it (use a 90% HDI). Does the recovered distribution make sense to you? Give also a summary of its aleatoric and epistemic uncertainty. Hint: For epistemic and aleatoric uncertainty use predict() on your model and pass your data point to get the predictive distribution conditioned on it. The epistemic uncertainty σe2\sigma_{e}^{2} is the variance of the predicted means μ\mu and the predictive uncertainty σp2\sigma_{p}^{2} the variance of the predictions for Balance. You may recover the aleatoric uncertainty using σp2=σe2+σa2\sigma_{p}^{2}=\sigma_{e}^{2}+\sigma_{a}^{2}.

Compute predictive distribution:

pred = credit_model1.predict( credit_trace1, kind="response", data=pd.DataFrame({'Limit': [3000]}), inplace=False )
pm.summary( pred.posterior_predictive, hdi_prob=0.9 )
Loading...
pm.plot_posterior( pred.posterior_predictive, hdi_prob=0.9 )
<Axes: title={'center': 'Balance\n0'}>
<Figure size 800x300 with 1 Axes>

With a probability of 90%, the balance will be between -153 and 618 USD. This is clearly an artefact of a ‘wrong’ likelihood model (normal), since balance (debts) cannot be negative.

var_e = pred.posterior.mu.var().values
var_p = pred.posterior_predictive.Balance.var().values
var_a = var_p - var_e
np.sqrt( var_a ), np.sqrt( var_e ), np.sqrt( var_p )
(np.float64(232.36122763552714), np.float64(14.903739211518685), np.float64(232.8387028626775))

Ratios:

var_e/var_p, var_a/var_p
(np.float64(0.004097133761204795), np.float64(0.9959028662387952))

Most of the uncertainty is of aleatoric nature! Collecting more data will not make the model better.

Exercise 2: Alternative Likelihoods

You are a bit frustrated by the negative balances predicted by your model in Exercise 1. You have the brilliant idea to change the likelihood of your model to produce only positive values. Instead of yβ0,β1,σN(β0+β1x,σ2)y|\beta_{0}, \beta_{1}, \sigma\sim N(\beta_{0}+\beta_{1}x,\sigma^{2}), you choose the likelihood

yTruncatedNormal(μ=β0+β1x,σ2,lower=0,upper=2000),y \sim TruncatedNormal(\mu = \beta_{0} + \beta_{1}x, \sigma^{2}, \text{lower}=0, \text{upper}=2000),

since the maximum observed balance in your data is 2000.

a)

Since the TruncatedNormal likelihood is not available in Bambi, implement your model in PyMC (slight modification of code used in lecture) and run the posterior simulation.

x = credit_data.Limit
y_obs = credit_data.Balance

with pm.Model() as credit_model2:
    # priors
    beta0 = pm.Normal('beta0', mu=credit_data.Balance.mean(), sigma=10 )
    beta1 = pm.Normal('beta1', mu=0, sigma=10 )
    sigma = pm.Exponential('sigma', lam=1/10 )

    # likelihood
    y = pm.TruncatedNormal('y', mu=beta0+beta1*x, lower=0, upper=2000, sigma=sigma, observed=y_obs )

    # simulate posterior
    credit_trace2 = pm.sample( 2000 )
Loading...
Loading...

b)

Compute the mean values for β0\beta_{0} and β1\beta_{1} and overplot a scatterplot of the observed data with the mean regression line. Interpret your results. Would you use this model in favour of the model in Exercise 1?

mean_beta0 = credit_trace2.posterior.beta0.mean().values
mean_beta1 = credit_trace2.posterior.beta1.mean().values
xrange = np.linspace( credit_data.Limit.min(), credit_data.Limit.max(), 1000 )

credit_data.plot.scatter( x="Limit", y="Balance" )
plt.plot( xrange, mean_beta0 + mean_beta1 * xrange )
<Figure size 800x300 with 1 Axes>

Even though the model does not produce negative values anymore, it’s bias is now too strong! The visible trend cannot be fitted. This model should definitely not be used.

Exercise 3: Multiple Linear Regression on Insurance Data

In this exercise, we are going to look at a sample insurance dataset that you can find in insurance.csv. The goal is to use the set of predictors provided in the data to predict the yearly charges in USD for clients to get an estimate of the overall reserve budget.

a)

Load the dataset with Pandas and get an overview over the nature of the different predictors that are provided. For a first analysis, we will use the predictors age, bmi and smoker for an estimation of the total charges. To this end, perform an exploratory data analysis and plot age and bmi against charges in two separate plots. Since smoker is a categorical variable, use it to color your observations (e.g. in seaborn with the hue parameter).

insurance_data = pd.read_csv("insurance.csv")
insurance_data.head()
Loading...
sns.scatterplot( data=insurance_data, x="age", y="charges", hue="smoker" )
<Axes: xlabel='age', ylabel='charges'>
<Figure size 800x300 with 1 Axes>
sns.scatterplot( data=insurance_data, x="bmi", y="charges", hue="smoker" )
<Axes: xlabel='bmi', ylabel='charges'>
<Figure size 800x300 with 1 Axes>

Smoker is clearly a very important variable!

b)

Fit a multiple Bayesian linear regression model with Bambi that predicts charges from age, bmi and smoker. Summarize the 95% HDIs for all involved parameters and give an interpretation for the role of each parameter in the prediction. Are all parameters significant in the Bayesian sense? Compute Bayesian RMSE and R2R^{2} to quantify the performance of the model.

insurance_model1 = bmb.Model("charges ~ age + bmi + smoker", data=insurance_data, family="gaussian")
insurance_trace1 = insurance_model1.fit(draws=2000, tune=2000)
pm.summary( insurance_trace1, hdi_prob=0.95 )
Loading...
Loading...
Loading...

All parameters are significant at the 95% level (no HDI includes zero). The model can be interpreted as follows: Each year of age results in a mean of 260 USD of additional charges, a BMI increase of 1 results in a mean of additional 322 USD of charges and smokers have on average 23’800 USD more insurance charges than non-smokers!

insurance_model1.predict(insurance_trace1, kind="response")
ypred = az.extract( insurance_trace1.posterior_predictive ).charges.values.T
ytrue = insurance_data.charges.values

RMSE:

np.sqrt( np.mean( (ypred - ytrue)**2 ) )
np.float64(8621.337788653796)

Our predictions will typically be away from the ground truth by ± 8600 USD!

R2R^2:

pm.r2_score( ytrue, ypred )
r2 0.663960 r2_std 0.010121 dtype: float64

There is still a lot of variance to be explained!

c)

Fit a model with all the available predictors. Do all predictors significantly contribute to the predictions of the model? Compute Bayesian RMSE and R2R^{2} to quantify the performance of the model. Which model would you rather use, the previous model including only age, bmi and smoker as predictors or the more complex current one?

insurance_data.head()
Loading...
insurance_model2 = bmb.Model("charges ~ age + sex + bmi + children + smoker + region", data=insurance_data, family="gaussian")
insurance_trace2 = insurance_model2.fit(draws=2000, tune=2000)
pm.summary( insurance_trace2, hdi_prob=0.95 )
Loading...
Loading...
Loading...

Sex and region[northwest] do not contribute significantly at the 95% level (HDIs include zero).

Performance metrics:

insurance_model2.predict(insurance_trace2, kind="response")
ypred = az.extract( insurance_trace2.posterior_predictive ).charges.values.T
ytrue = insurance_data.charges.values

RMSE:

np.sqrt( np.mean( (ypred - ytrue)**2 ) )
np.float64(8578.626184138278)

R2R^2:

pm.r2_score( ytrue, ypred )
r2 0.666538 r2_std 0.010132 dtype: float64

The improvement in terms of metrics is marginal. Rather use the more simple model from b) for this data!

d)

Could your more complex model get better with more data? Compute epistemic and aleatoric variance of the predictive distribution on your dataset using predict() and setting data = insurance_data.

Hint: The rest is like in Exercise 1.

Compute model predictions:

pred = insurance_model2.predict( insurance_trace2, kind="response", data=insurance_data, inplace=False )

Compute different uncertainties (see lecture):

var_e = pred.posterior.mu.var().values
np.sqrt( var_e )
np.float64(10499.037071488348)
var_p = pred.posterior_predictive.charges.var().values
np.sqrt( var_p )
np.float64(12125.139005403647)
var_a = var_p - var_e
np.sqrt( var_a )
np.float64(6065.411484134803)
var_e / var_p, var_a / var_p
(np.float64(0.749765557528311), np.float64(0.25023444247168897))

There is a significant part of epistemic uncertainty! With more data we might thus also be able to use a model with more predictors (emphasis on might).

e)

It is often said that there is an aggravating effect when an individual is both overweight and a smoker, with even bigger impacts than just the sum of the two on personal health. You want to verify this on the given dataset by adding an interaction term between bmi and smoker. Fit a model and report and interpret your results with pm.summary(). Did your prediction performance in terms of R2R^{2} increase?

insurance_model3 = bmb.Model("charges ~ age + bmi + smoker + bmi:smoker", data=insurance_data, family="gaussian")
insurance_trace3 = insurance_model3.fit(draws=2000, tune=2000)
pm.summary( insurance_trace3, hdi_prob=0.95 )
Loading...
Loading...
Loading...

The interaction term is significant (and large), let’s keep it if the predicitive performance increases as well:

insurance_model3.predict(insurance_trace3, kind="response")
ypred = az.extract( insurance_trace3.posterior_predictive ).charges.values.T
ytrue = insurance_data.charges.values
pm.r2_score( ytrue, ypred )
r2 0.752790 r2_std 0.008084 dtype: float64

A very significant improvement by almost 9% in explained variance! This is a clear indicator that this interaction term cannot be neglected when doing predictions.

f)

Go back to your initial plots in a). Is evidence for the significant interaction between overweight and smoking visible in one of the plots?

sns.scatterplot( data=insurance_data, x="bmi", y="charges", hue="smoker" )
<Axes: xlabel='bmi', ylabel='charges'>
<Figure size 800x300 with 1 Axes>

The dependence of charges on BMI is much steeper if the client is a smoker!

Exercise 4: Multiple Linear Regression on Advertisement Data

In this exercise you will perform multiple linear regression on the advertisement dataset introduced by the famous book Introduction to Statistical Learning. It contains data of an advertisement campaign with the budget used for TV, radio and newspaper ads in 1000 USD against the number of units that were sold (in thousands).

a)

Load the dataset with Pandas and get an overview over the different predictors that are provided. Create some exploratory plots to guess which variables will probably be related to the sales variable.

adv_data = pd.read_csv("advertising.csv")
adv_data.head()
Loading...
sns.scatterplot( data=adv_data, x="TV", y="sales" )
<Axes: xlabel='TV', ylabel='sales'>
<Figure size 800x300 with 1 Axes>
sns.scatterplot( data=adv_data, x="radio", y="sales" )
<Axes: xlabel='radio', ylabel='sales'>
<Figure size 800x300 with 1 Axes>
sns.scatterplot( data=adv_data, x="newspaper", y="sales" )
<Axes: xlabel='newspaper', ylabel='sales'>
<Figure size 800x300 with 1 Axes>

Expect contributions by radio and TV, probably not by newspaper.

b)

Fit a multiple linear regression model including all the available predictors. Which contribute significantly to the model? What is your interpretation of the coefficients?

adv_model1 = bmb.Model("sales ~ TV + radio + newspaper", data=adv_data, family="gaussian")
adv_trace1 = adv_model1.fit(draws=2000, tune=2000)
pm.summary( adv_trace1, hdi_prob=0.95 )
Loading...
Loading...
Loading...

The 95% HDI for newspaper includes zero, thus newspaper does not contribute significantly at the 95% level. For each 1000 USD spent on TV ads, 46 more units are sold (43-49 in 95% HDI), for each 1000 USD spent on radio ads, 189 more units are sold (172-205 in 95% HDI).

c)

Compute predictive RMSE and R2R^{2} score. Is this a good model? Give an interpretation in particular of RMSE.

adv_model1.predict(adv_trace1, kind="response")
ypred = az.extract( adv_trace1.posterior_predictive ).sales.values.T
ytrue = adv_data.sales.values

RMSE:

rmse = np.sqrt( np.mean( (ypred - ytrue)**2 ) )
rmse
np.float64(2.392592625451304)
rmse / np.mean(adv_data.sales)
np.float64(0.17062525408816573)

Our predictions will typically be away from the true value by ± 2400 sold units, this is about 20% of the units being typically sold. For the advertisement business I wouldn’t expect much more..

R2R^2:

pm.r2_score( ytrue, ypred )
r2 0.826892 r2_std 0.015734 dtype: float64

d)

Refit your model without newspaper. Do RMSE and R2R^{2} change significantly?

adv_model2 = bmb.Model("sales ~ TV + radio", data=adv_data, family="gaussian")
adv_trace2 = adv_model2.fit(draws=2000, tune=2000)
pm.summary( adv_trace2, hdi_prob=0.95 )
Loading...
Loading...
Loading...
adv_model2.predict(adv_trace2, kind="response")
ypred = az.extract( adv_trace2.posterior_predictive ).sales.values.T
ytrue = adv_data.sales.values

RMSE:

np.sqrt( np.mean( (ypred - ytrue)**2 ) )
np.float64(2.386114969682464)

R2R^2:

pm.r2_score( ytrue, ypred )
r2 0.827571 r2_std 0.015798 dtype: float64

No significant changes in RMSE or R2R^2.

e)

There might be synergy or redundancy between TV and radio advertisement budget: either viewers who get the ad through both channels are even more inclined to buy the product (synergy) or it is enough that they hear it through one channel and the other one is an unnecessary repetition (redundancy), or something in-between of course. Introduce an interaction between TV and radio and interpret your results. Is the interaction term significant? Can we observe a synergy or a redundancy? How does the predictive performance change?

adv_model3 = bmb.Model("sales ~ TV + radio + TV:radio", data=adv_data, family="gaussian")
adv_trace3 = adv_model3.fit(draws=2000, tune=2000)
pm.summary( adv_trace3, hdi_prob=0.95 )
Loading...
Loading...
Loading...

The interaction term is significant and positive - meaning that if TV and radio are high, the amount of sales is even higher (synergy).

Predictive performance:

adv_model3.predict(adv_trace3, kind="response")
ypred = az.extract( adv_trace3.posterior_predictive ).sales.values.T
ytrue = adv_data.sales.values

RMSE:

rmse = np.sqrt( np.mean( (ypred - ytrue)**2 ) )
rmse
np.float64(1.3383383616612556)
rmse / adv_data.sales.mean()
np.float64(0.09544220799866326)

Our predictions will now typically be away from the true value by only ± 1300 sold units! this is about 10% of the units being typically sold! This is good..

R2R^2:

pm.r2_score( ytrue, ypred )
r2 0.938312 r2_std 0.006006 dtype: float64

A large part of the variance is now explained.

f)

You might have noticed in the exploratory analysis performed in part a) that you are rather in a heteroscedastic than a homescedastic setting, where variance is not constant anymore but depends on the value of the predictor. To alleviate this, model the logarithm of sales instead of just sales. Show using scatter plots that this is a reasonable approach and compare the performance of your resulting model (including interactions) with your model from d).

adv_data['logsales'] = np.log( adv_data.sales )
adv_data.head()
Loading...
sns.scatterplot( data=adv_data, x="TV", y="logsales", )
<Axes: xlabel='TV', ylabel='logsales'>
<Figure size 800x300 with 1 Axes>
sns.scatterplot( data=adv_data, x="radio", y="logsales" )
<Axes: xlabel='radio', ylabel='logsales'>
<Figure size 800x300 with 1 Axes>
sns.scatterplot( data=adv_data, x="newspaper", y="logsales" )
<Axes: xlabel='newspaper', ylabel='logsales'>
<Figure size 800x300 with 1 Axes>

The variance is now more constant along the predictors, however the residuals look a bit more skewed now. Let’s check how this works!

adv_model4 = bmb.Model("logsales ~ TV + radio + TV:radio", data=adv_data, family="gaussian")
adv_trace4 = adv_model4.fit(draws=2000, tune=2000)
pm.summary( adv_trace4, hdi_prob=0.95 )
Loading...
Loading...
Loading...

Predictive performance:

adv_model4.predict(adv_trace4, kind="response")
ypred = np.exp( az.extract( adv_trace4.posterior_predictive ).logsales.values.T )
ytrue = adv_data.sales.values

RMSE:

rmse = np.sqrt( np.mean( (ypred - ytrue)**2 ) )
rmse
np.float64(3.2639485089464157)
rmse / adv_data.sales.mean()
np.float64(0.23276509245472743)

R2R^2:

pm.r2_score( ytrue, ypred )
r2 0.796211 r2_std 0.017985 dtype: float64

The model performs much worse. Probably because we violated the assumption of normally distributed residuals.

Exercise 5: Robust Linear Regression

In this exercise, you will model body fat from BMI (data in bodyfat.csv) as done in the lecture, however including the real outliers in the dataset.

a)

Load the dataset with Pandas, create a scatter plot of BMI against BodyFat and convince your self of the presence of outliers that might influence the coefficients of linear models.

bodyfat_data = pd.read_csv("bodyfat.csv")
bodyfat_data.head()
Loading...
sns.scatterplot( bodyfat_data, x="BMI", y="BodyFat" )
<Axes: xlabel='BMI', ylabel='BodyFat'>
<Figure size 800x300 with 1 Axes>

b)

Fit a simple linear regression model with normal likelihood with Bambi and assess its predictive performance in terms of RMSE and R2R^{2}. In addition, plot the regression model with bmb.interpret.plot_predictions().

bodyfat_model1 = bmb.Model("BodyFat ~ BMI", data=bodyfat_data, family="gaussian")
bodyfat_trace1 = bodyfat_model1.fit(draws=2000, tune=2000)
pm.summary( bodyfat_trace1, hdi_prob=0.95 )
Loading...
Loading...
Loading...

Plot:

bodyfat_data.plot.scatter( x="BMI", y="BodyFat" )
bmb.interpret.plot_predictions( bodyfat_model1, bodyfat_trace1, "BMI", prob=0.95, ax=plt.gca() )
(<Figure size 800x300 with 1 Axes>, array([<Axes: xlabel='BMI', ylabel='BodyFat'>], dtype=object))
<Figure size 800x300 with 1 Axes>

Predictive performance:

bodyfat_model1.predict(bodyfat_trace1, kind="response")
ypred = az.extract( bodyfat_trace1.posterior_predictive ).BodyFat.values.T
ytrue = bodyfat_data.BodyFat.values

RMSE:

np.sqrt( np.mean( (ypred - ytrue)**2 ) )
np.float64(8.151465963173502)

Our predictions will now typically be away from the true value by around ± 8%.

R2R^2:

pm.r2_score( ytrue, ypred )
r2 0.513199 r2_std 0.027314 dtype: float64

There are better R2R^2 values in the world..

c)

As an alternative, fit a robust model that uses a Student’s t distribution as likelihood instead of a normal distribution (see exercises in week 5 for insights into Student’s t distribution). You may do this with Bambi by choosing family="t" instead of family="gaussian". Do the non-robust and the robust model differ significantly? Compare RMSE and R2R^{2} and plot both models in the same plot with bmb.interpret.plot_predictions(). Which model would you use at the end?

bodyfat_model2 = bmb.Model("BodyFat ~ BMI", data=bodyfat_data, family="t")
bodyfat_trace2 = bodyfat_model2.fit(draws=2000, tune=2000)
pm.summary( bodyfat_trace2, hdi_prob=0.95 )
Loading...
Loading...
Loading...

Plot:

bodyfat_data.plot.scatter( x="BMI", y="BodyFat" )
bmb.interpret.plot_predictions( bodyfat_model1, bodyfat_trace1, "BMI", prob=0.95, ax=plt.gca() )
bmb.interpret.plot_predictions( bodyfat_model2, bodyfat_trace2, "BMI", prob=0.95, ax=plt.gca() )
(<Figure size 800x300 with 1 Axes>, array([<Axes: xlabel='BMI', ylabel='BodyFat'>], dtype=object))
<Figure size 800x300 with 1 Axes>

Predictive performance:

bodyfat_model2.predict(bodyfat_trace2, kind="response")
ypred = az.extract( bodyfat_trace2.posterior_predictive ).BodyFat.values.T
ytrue = bodyfat_data.BodyFat.values

RMSE:

np.sqrt( np.mean( (ypred - ytrue)**2 ) )
np.float64(8.182406471636726)

Our predictions will now typically be away from the true value by around ±8\pm 8%.

R2R^2:

pm.r2_score( ytrue, ypred )
r2 0.524749 r2_std 0.027579 dtype: float64

Only very slight improvement! Probably better prefer the simpler model.

d)

As an addition model selection tool, you consider computing ELPDs for both models and to compare them (see week 5). Compute the log-likelihood for both model traces with pm.compute_log_likelihood() (in the context of the underlying PyMC model that can be accessed through backend.model) and then use pm.compare() and pm.plot_compare(). Which do you trust more in this case, RMSE and R2R^{2} or ELPD?

Compute log likelihoods:

with bodyfat_model1.backend.model:
    pm.compute_log_likelihood( bodyfat_trace1 )

with bodyfat_model2.backend.model:
    pm.compute_log_likelihood( bodyfat_trace2 )
Loading...
Loading...
Loading...
Loading...

Compare ELPDs:

compare_dict = {
    'Normal model': bodyfat_trace1, 
    'Robust model': bodyfat_trace2 
}
comp = pm.compare( compare_dict )
comp
Loading...
pm.plot_compare( comp )
<Axes: title={'center': 'Model comparison\nhigher is better'}, xlabel='elpd_loo (log)', ylabel='ranked models'>
<Figure size 600x200 with 1 Axes>

There seems to a big difference in terms of predictive performance. However both loo estimates produced warnings and our predictive performance estimates say otherwise. Personally, I would trust more in RMSE and R2R^2 here.

e)

Just for fun: Fit a robust multiple linear regression model with all variables (except Density that is used to compute BodyFat) and check whether it is an improvement on the models in b) and c) in terms of RMSE and R2R^{2}. Check the posterior distribution of ν\nu. Would you at the end rather use a normal or a Student’s t likelihood?

bodyfat_data.head()
Loading...
bodyfat_model3 = bmb.Model(
    "BodyFat ~ Age + Weight + Height + Neck + Chest + Abdomen + Hip + Thigh + Knee + Ankle + Biceps + Forearm + Wrist + BMI", 
    data=bodyfat_data, 
    family="t"
)
bodyfat_trace3 = bodyfat_model3.fit(draws=2000, tune=2000)
pm.summary( bodyfat_trace3, hdi_prob=0.95 )
Loading...
Loading...
Loading...

The uncertainty on ν\nu is still quite large! Since it is large (mean of 33), probably a normal likelihood could be preferred. The following variables do not seem to be significant at to 95% level: Neck, Chest, Thigh, Knee, Ankle, Biceps, Forearm. It might make sense to refit a model without them.

Predictions:

bodyfat_model3.predict(bodyfat_trace3, kind="response")
ypred = az.extract( bodyfat_trace3.posterior_predictive ).BodyFat.values.T
ytrue = bodyfat_data.BodyFat.values

RMSE:

np.sqrt( np.mean( (ypred - ytrue)**2 ) )
np.float64(6.068379228018641)

Our predictions will now typically be away from the true value by around ±6\pm 6%, this is 25% better!

R2R^2:

pm.r2_score( ytrue, ypred )
r2 0.662684 r2_std 0.023387 dtype: float64

Quite a bit more variance explained!

Exercise 6: Leaking Satellite Fuel Tank

You work for the European Space Agency in the maintenance team for a satellite. Next to scientific data, the satellite sends back a host of maintenance data, among other things the fuel level of its thrusters that are needed to keep it on the designed orbit. Since sending satellite data down to Earth is expensive (antenna network distributed over the Earth) and a large part of the bandwidth is reserved for scientific data and redundancy, you get only one fuel level measurement per day.

Three days ago you have discovered that the fuel tank level has decreased by a larger amount than usual. You suspect that the satellite has been hit by space debris that has produced a hole in the fuel tank. The goal of this exercise is to compute the remaining time in the satellite mission, before the thrusters fail by lack of fuel and the satellite will leave its carefully calibrated orbit.

a)

Load the dataset from fuel_data.csv. It contains the time tt in days (starting from zero), the tank level in milligrams and the suspected state of the fuel tank (nominal/leaking). Plot the tank level against the time tt and verify that the point in time where the tank level decreases more quickly is visible (at t=19t=19).

fuel_data = pd.read_csv("fuel_data.csv")
fuel_data
Loading...

Plot:

sns.lineplot( data=fuel_data, x="t", y="tank_level", c="black" )
sns.scatterplot( data=fuel_data, x="t", y="tank_level", hue="state", s=50 )
<Axes: xlabel='t', ylabel='tank_level'>
<Figure size 800x300 with 1 Axes>

b)

First you fit a model to the nominal data points with the yet intact fuel tank to guess the regular rate rr of fuel consumption in mg. To this end, fit a linear regression model to t[0,19]t \in [0, 19] and guess a 95% HDI of rr.

Because there is prior information available from your engineering team, you decide to choose your own priors instead of letting Bambi choose weak default priors:

  • Bambi chooses for intercept not the value at t=0t=0, but the value in the center of the data distribution. In this case, the center falls between 9 and 10, so you choose as intercept y0=12(tank_level[9]+tank_level[10])y_{0}=\frac{1}{2}(\text{tank\_level}[9] + \text{tank\_level}[10]). The engineers in your project team say that there is an uncertainty of σ=0.2\sigma=0.2 mg in the tank level.

  • For fuel rate, your engineering team says that the thrusters consume at around 40μg40\mu g per day, possibly also something between 20μg20\mu g and 60μg60\mu g. Use PreliZ to devise an appropriate normal prior. Hint: Make sure that the rate is negative (decrease).

Pass your priors to bambi and fit a model to the first 20 data points (t19t \le 19). Visualize your model predictions with bmb.interpret.plot_predictions() with pps=True and estimate a 95% HDI for the regular fuel consumption rate rr.

Prior for intercept:

y0 = (fuel_data.tank_level[9]+fuel_data.tank_level[10])/2
pz.Normal(y0, 0.2).plot_pdf()
<Axes: >
<Figure size 800x300 with 1 Axes>

Prior for regular fuel decrease:

pz.Normal(-0.04, 0.01).plot_pdf()
<Axes: >
<Figure size 800x300 with 1 Axes>

Fit model:

priors = {
    "Intercept": bmb.Prior("Normal", mu=y0, sigma=0.2),
    "t": bmb.Prior("Normal", mu=-0.04, sigma=0.01), 
}
normal_tank_model = bmb.Model("tank_level ~ t", data=fuel_data[fuel_data.t <= 19], priors=priors, family="gaussian")
normal_tank_trace = normal_tank_model.fit(draws=2000, tune=2000)
Loading...
Loading...
fuel_data.plot( x="t", y="tank_level", marker="o", c="blue" )
bmb.interpret.plot_predictions( normal_tank_model, normal_tank_trace, "t", prob=0.95, pps=True, ax=plt.gca() )
(<Figure size 800x300 with 1 Axes>, array([<Axes: xlabel='t', ylabel='tank_level'>], dtype=object))
<Figure size 800x300 with 1 Axes>
pm.summary( normal_tank_trace, hdi_prob=0.95 )
Loading...

Typical fuel consumption between 37 and 47 μg\mu g per day (95% HDI, results may vary slightly for different simulations).

c)

Now fit a model to the data points where you suspect the leak (t19t \ge 19). This time let Bambi choose default priors for you. Visualize your model predictions again with bmb.interpret.plot_predictions() with pps=True and estimate a 95% HDI for the leaking fuel consumption rate rr^{*}.

fuel_data[fuel_data.t>=19]
Loading...
leaking_tank_model = bmb.Model("tank_level ~ t", data=fuel_data[fuel_data.t>=19], family="gaussian")
leaking_tank_trace = leaking_tank_model.fit(draws=2000, tune=2000)
Loading...
Loading...
fuel_data.plot( x="t", y="tank_level", marker="o", c="blue" )
bmb.interpret.plot_predictions( leaking_tank_model, leaking_tank_trace, "t", prob=0.95, ax=plt.gca() )
(<Figure size 800x300 with 1 Axes>, array([<Axes: xlabel='t', ylabel='tank_level'>], dtype=object))
<Figure size 800x300 with 1 Axes>
pm.summary( leaking_tank_trace )
Loading...

fuel loss between 228 and 697 μ\mug per day! (results may vary from simulation to simulation, will get better with more data)

d)

Mission control asks you how long they can expect the satellite mission to continue, i.e. how many days the mission can last until the fuel tank is empty. Use the 95% HDI on rr^{*} to give an estimate.

t0 = fuel_data.tank_level[22] / np.abs( pm.hdi( leaking_tank_trace.posterior.t, 0.95 ) )
t0
Loading...

Between 63 and 232 days with a belief of 95%, however this is probably quite uncertain (especially if you re-run the simulation). You’ll know more with more data!

Let us plot the distribution to better understand our uncertainty:

t0_values = -fuel_data.tank_level[22] / leaking_tank_trace.posterior.t.values.flatten()
plt.hist( t0_values.clip(0, 300), bins=20 );
<Figure size 800x300 with 1 Axes>

(a few strong outliers - interesting exercise: find a good prior for tt to make them disappear)