Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Bayesian Linear Regression

Simple linear regression

We start with a single predictor xx and a continuous response yy.

The deterministic regression line is

μ(x)=E[YX=x]=β0+β1x,\mu(x) = \mathbb{E}[Y \mid X = x] = \beta_0 + \beta_1 x,

with intercept β0\beta_0 and slope β1\beta_1.

To make this into a probabilistic model, we add a noise term:

Yi=β0+β1xi+εi,εiN(0,σ2),i=1,,n.Y_i = \beta_0 + \beta_1 x_i + \varepsilon_i, \qquad \varepsilon_i \sim \mathcal{N}(0, \sigma^2), \quad i = 1,\dots,n.

Equivalently, the likelihood for each observation is

Yiβ0,β1,σ,xiN(β0+β1xi,  σ2).Y_i \mid \beta_0,\beta_1,\sigma,x_i \sim \mathcal{N}(\beta_0 + \beta_1 x_i,\; \sigma^2).

In vector–matrix form, with design matrix

X=(1x11x21xn),β=(β0β1),X = \begin{pmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_n \end{pmatrix}, \quad \boldsymbol{\beta} = \begin{pmatrix} \beta_0 \\ \beta_1 \end{pmatrix},

the model is

yβ,σN(Xβ,σ2In).\mathbf{y} \mid \boldsymbol{\beta}, \sigma \sim \mathcal{N}\big(X\boldsymbol{\beta}, \sigma^2 I_n\big).

Likelihood and ordinary least squares (OLS)

Under the normal likelihood

p(yβ,σ,X)=i=1n12πσexp ⁣(12σ2(yi(β0+β1xi))2),p(\mathbf{y} \mid \boldsymbol{\beta},\sigma,X) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi}\,\sigma} \exp\!\left( -\frac{1}{2\sigma^2} \big(y_i - (\beta_0 + \beta_1 x_i)\big)^2 \right),

the log-likelihood (up to additive constants) is

logp(yβ,σ,X)=12σ2i=1n(yi(β0+β1xi))2+const.\log p(\mathbf{y} \mid \boldsymbol{\beta},\sigma,X) = -\frac{1}{2\sigma^2} \sum_{i=1}^{n} \big(y_i - (\beta_0 + \beta_1 x_i)\big)^2 + \text{const}.

Maximizing this log-likelihood w.r.t. β\boldsymbol{\beta} (for fixed σ\sigma) is equivalent to minimizing the sum of squared residuals:

β^OLS=argminβi=1n(yi(β0+β1xi))2.\hat{\boldsymbol{\beta}}_{\text{OLS}} = \arg\min_{\boldsymbol{\beta}} \sum_{i=1}^{n} \big(y_i - (\beta_0 + \beta_1 x_i)\big)^2.

In matrix form, the OLS solution has a closed form:

β^OLS=(XX)1Xy,\hat{\boldsymbol{\beta}}_{\text{OLS}} = (X^\top X)^{-1} X^\top \mathbf{y},

provided XXX^\top X is invertible.

OLS is the maximum likelihood estimator for β\boldsymbol{\beta} under the normal linear model with constant variance σ2\sigma^2.

Bayesian linear regression

In the Bayesian view, the parameters (β0,β1,σ)(\beta_0,\beta_1,\sigma) are random variables with a prior distribution.

A common prior choice (conceptually) is:

  • Regression coefficients:

    βN(μ0,  Σ0),\boldsymbol{\beta} \sim \mathcal{N}(\boldsymbol{\mu}_0,\; \Sigma_0),
  • Noise scale:

    σsome prior on (0,)(e.g. half-normal, exponential, or half-Cauchy).\sigma \sim \text{some prior on } (0,\infty) \quad\text{(e.g. half-normal, exponential, or half-Cauchy).}

Given the data (y,X)(\mathbf{y}, X), the posterior over parameters is

p(β,σy,X)=p(yβ,σ,X)p(β,σ)p(yX).p(\boldsymbol{\beta},\sigma \mid \mathbf{y}, X) = \frac{ p(\mathbf{y} \mid \boldsymbol{\beta},\sigma,X)\, p(\boldsymbol{\beta},\sigma) }{ p(\mathbf{y} \mid X) }.

Using Bayes’ theorem explicitly,

p(β,σy,X)p(yβ,σ,X)p(β,σ).p(\boldsymbol{\beta},\sigma \mid \mathbf{y}, X) \propto p(\mathbf{y} \mid \boldsymbol{\beta},\sigma,X)\, p(\boldsymbol{\beta},\sigma).

For conjugate priors (normal–inverse-gamma), the posterior has an analytic form; in practice we often let PyMC/Bambi do the MCMC sampling from p(β,σy,X)p(\boldsymbol{\beta},\sigma \mid \mathbf{y},X).

Posterior predictive distribution for regression

For a new input xnewx_{\text{new}}, the model implies

Ynewβ0,β1,σ,xnewN(β0+β1xnew,σ2).Y_{\text{new}} \mid \beta_0,\beta_1,\sigma,x_{\text{new}} \sim \mathcal{N}(\beta_0 + \beta_1 x_{\text{new}}, \sigma^2).

The posterior predictive distribution, which averages over uncertainty in (β0,β1,σ)(\beta_0,\beta_1,\sigma), is

p(ynewxnew,y,X)=p(ynewxnew,β0,β1,σ)p(β0,β1,σy,X)dβ0dβ1dσ.p(y_{\text{new}} \mid x_{\text{new}}, \mathbf{y}, X) = \int p(y_{\text{new}} \mid x_{\text{new}}, \beta_0,\beta_1,\sigma)\, p(\beta_0,\beta_1,\sigma \mid \mathbf{y}, X)\, d\beta_0\, d\beta_1\, d\sigma.

In practice, with posterior samples {(β0(s),β1(s),σ(s))}s=1S\{(\beta_0^{(s)},\beta_1^{(s)},\sigma^{(s)})\}_{s=1}^{S}, we approximate this by simulation:

  1. For each posterior draw ss,

    ynew(s)N(β0(s)+β1(s)xnew,(σ(s))2).y_{\text{new}}^{(s)} \sim \mathcal{N}\big(\beta_0^{(s)} + \beta_1^{(s)} x_{\text{new}}, (\sigma^{(s)})^2\big).
  2. The sample {ynew(s)}s=1S\{y_{\text{new}}^{(s)}\}_{s=1}^{S} approximates p(ynewxnew,y,X)p(y_{\text{new}} \mid x_{\text{new}}, \mathbf{y}, X).

The posterior predictive distribution combines:

  • Aleatoric uncertainty: the inherent noise ε\varepsilon in yy,

  • Epistemic uncertainty: uncertainty about (β0,β1,σ)(\beta_0,\beta_1,\sigma) due to finite data.

Epistemic vs aleatoric uncertainty in regression

In the linear model

Y=β0+β1X+ε,εN(0,σ2),Y = \beta_0 + \beta_1 X + \varepsilon, \quad \varepsilon \sim \mathcal{N}(0,\sigma^2),

we distinguish:

  • Aleatoric uncertainty: variability in YY even if (β0,β1,σ)(\beta_0,\beta_1,\sigma) were known exactly. In this model, for a fixed xx, this is just Var(YX=x,β0,β1,σ)=σ2\operatorname{Var}(Y \mid X=x,\beta_0,\beta_1,\sigma) = \sigma^2.

  • Epistemic uncertainty: additional variability in predictions due to uncertainty in (β0,β1,σ)(\beta_0,\beta_1,\sigma) as captured by the posterior p(β0,β1,σy,X)p(\beta_0,\beta_1,\sigma \mid \mathbf{y},X).

For the posterior predictive variance at a given xnewx_{\text{new}}, the law of total variance gives

Var(Ynewxnew,y,X)=Eθy,X[Var(Ynewxnew,θ)]+Varθy,X[E(Ynewxnew,θ)],\operatorname{Var}(Y_{\text{new}} \mid x_{\text{new}},\mathbf{y},X) = \mathbb{E}_{\theta \mid \mathbf{y},X} \big[ \operatorname{Var}(Y_{\text{new}} \mid x_{\text{new}},\theta) \big] + \operatorname{Var}_{\theta \mid \mathbf{y},X} \big[ \mathbb{E}(Y_{\text{new}} \mid x_{\text{new}},\theta) \big],

where θ=(β0,β1,σ)\theta = (\beta_0,\beta_1,\sigma).

  • The first term corresponds to aleatoric variance (expected σ2\sigma^2).

  • The second term corresponds to epistemic variance (spread of the regression line due to parameter uncertainty).

With many data points covering a range of xx, the epistemic term often becomes small near the data but remains large when extrapolating far away from observed xx values.

Gaussian linear regression

import numpy as np
import matplotlib.pyplot as plt
from ipywidgets import interact, IntSlider, FloatSlider, Checkbox

plt.rcParams["figure.figsize"] = (4, 4)

Linear regression is perhaps the most frequently used technique in applied statistics for modelling the relationship between set of a predictor variables {xi}i=1N\left\lbrace \mathbf{x}_i \right\rbrace_{i=1}^N and a response variable {yi}i=1N\left\lbrace y_i \right\rbrace_{i=1}^N. More formally, let XRN×D\mathbf{X} \in \mathbb{R}^{N \times D} be a design matrix and let yRN\mathbf{y} \in \mathbb{R}^N be the response variables collected into a single vector, then the linear regression model is given by

y=Xw+ϵ\begin{align} \mathbf{y} = \mathbf{X}\mathbf{w} + \mathbf{\epsilon} \end{align}

where wRD\mathbf{w} \in \mathbb{R}^D is the regression weights and ϵRN\mathbf{\epsilon} \in \mathbb{R}^N is the observation noise vector.

Assuming isotropic Gaussian noise and imposing a multivariate Gaussian prior on wN(m,S)\mathbf{w} \sim \mathcal{N}\left(\mathbf{m}, \mathbf{S}\right) gives rise to the following joint distribution

p(y,w)=p(yw)p(w)=N(yXw,σ2I)N(wm,S)\begin{align} p(\mathbf{y}, \mathbf{w}) = p\left(\mathbf{y}|\mathbf{w}\right)p\left(\mathbf{w}\right) = \mathcal{N}\left(\mathbf{y}\big|\mathbf{Xw}, \sigma^2\mathbf{I}\right)\mathcal{N}\left(\mathbf{w}\big|\mathbf{m}, \mathbf{S}\right) \end{align}

In this notebook, we will use the following simple model as running example:

yi=axi+b+ϵi=[xi1][ab]+ϵi\begin{align} y_i = ax_i + b + \epsilon_i = \begin{bmatrix}x_i&1\end{bmatrix}\begin{bmatrix}a\\b\end{bmatrix} + \epsilon_i \end{align}

That is, the parameters are w=[a,b]\mathbf{w} = \left[a, b\right], where aa and bb are the slope and intercept of the line, respectively. Furthermore, we will assume m=0\mathbf{m} = \mathbf{0}, S=I\mathbf{S} = \mathbf{I}, and σ2=2\sigma^2 = 2.

We will use the following dataset with N=8N = 8 data points

Source
# hyperparameters of the prior
m = np.zeros((2, 1))
S = 1.0*np.identity(2)

# noise variance
sigma2 = 2

# data
N = 8
x = np.array([1.764, 0.4, 0.979, 2.241, 1.868, -0.977,  0.95, -0.151])[:, None]
y = np.array([-0.464, 2.024, 3.191, 2.812, 6.512, -3.022, 1.99, 0.009])[:, None]

# plot
plt.plot(x, y, 'k.', label='Data', markersize=12)
plt.xlabel('Predictor Variable x')
plt.ylabel('Response y')
plt.legend()
plt.xlim((-3, 3))
plt.ylim((-6, 6))
plt.grid()
<Figure size 400x400 with 1 Axes>

The prior, likelihood and posterior

This section illustrates the basic building blocks of the linear model.

Reference: Rasmussen and Williams, Section 2.1, for the derivation of these functions (http://www.gaussianprocess.org/gpml/ ).

The multivariate Gaussian density function

The multivariate Gaussian density function, implemented using numpy.

N(xμ,Σ)=(2π)D2Σ12exp[12(xμ)TΣ1(xμ)]\begin{align} N(x | \mu, \Sigma ) = (2\pi)^{-\frac{D}{2}} | \Sigma |^{-\frac{1}{2}} \exp\left[-\frac{1}{2}(x - \mu)^T \Sigma^{-1} (x - \mu) \right] \end{align}
gaussian_linear_regression.py
def mvn_pdf(x, mu, Sigma, log=True):
    """ Returns the density of a multivariate Gaussian distribution
    with mean mu and covariace Sigma at point x

    Arguments:
    x     -- (Dx1) evaluation point
    mu    -- (Dx1) mean vector
    Sigma -- (DxD) covariance matrix
    log   -- (bool) if true, return log density. If false, return the density. (default=True)

    Returns:
    (scalar) density
    """
    D = len(mu)
    Sigma = Sigma + 1e-10 * np.identity(D)
    L = np.linalg.cholesky(Sigma)
    v = np.linalg.solve(L, x-mu)
    const_term = -0.5 * D * np.log(2 * np.pi)
    det_term = -0.5 * 2 * np.sum(np.log(np.diag(L)))
    quad_term = -0.5 * np.sum(v ** 2)
    if log:
        return const_term + det_term + quad_term
    else:
        return np.exp(const_term + det_term + quad_term)
  • Computes the (log-)density of a multivariate Gaussian N(xμ,Σ)\mathcal{N}(x \mid \mu, \Sigma) using a numerically stable Cholesky decomposition of the covariance matrix.

  • Adds a small diagonal jitter to Σ\Sigma to ensure positive definiteness and avoid numerical issues during factorization.

  • Uses the Cholesky factor LL to efficiently solve v=L1(xμ)v = L^{-1}(x - \mu), which is required to compute the quadratic form (xμ)Σ1(xμ)=v2(x - \mu)^\top \Sigma^{-1} (x - \mu) = \|v\|^2.

  • Combines the constant term, log-determinant term (computed from logdetΣ=2logLii\log \det \Sigma = 2 \sum \log L_{ii}), and quadratic term to return either the log-density or the density.

Source
from scipy.stats import multivariate_normal
import gaussian_linear_regression as glr

x_test = np.array([1,1])
print("Scipy multivariate_normal model")
pdf = multivariate_normal.pdf(x_test, mean=m.flatten(), cov=S)
print("The pdf of [1,1] is", pdf)
logpdf = multivariate_normal.logpdf(x_test, mean=m.flatten(), cov=S)
print("The logpdf of [1,1] is", logpdf)

x_test = np.array([1,1]).reshape(2,1)
print("\nOwn mvn_pdf")
pdf = glr.mvn_pdf(x_test, mu=m, Sigma=S, log=False)
print("The pdf of [1,1] is", pdf)
logpdf = glr.mvn_pdf(x_test, mu=m, Sigma=S, log=True)
print("The logpdf of [1,1] is", logpdf)
Scipy multivariate_normal model
The pdf of [1,1] is 0.05854983152431917
The logpdf of [1,1] is -2.8378770664093453

Own mvn_pdf
The pdf of [1,1] is 0.05854983152431917
The logpdf of [1,1] is -2.8378770664093453

The Log density functions

The Log density of the (i) prior, (ii) likelihood, and (iii) posterior (likelihood times prior), implemented using the mvn_pdf function.

gaussian_linear_regression.py
def log_prior(a, b, m, S):
    """ returns the log density of the prior at the points (a,b) given m and S

    Arguments:
    a    -- (scalar) slope parameter
    b    -- (scalar) intercept parameter
    m    -- (2x1) The prior mean
    S    -- (2x2) The prior covariance

    Returns
    (scalar) log density for the pair (a,b)

    """
    return mvn_pdf(np.array([a,b])[:, None], m, S, log=True)

def log_likelihood(x, y, a, b, sigma2):
    """ returns the log-likelihood for the data (x,y) given the parameters (a,b)

    Arguments:
    x    -- (Nx1) vector of inputs
    y    -- (Nx1) vector of responses
    a    -- slope parameter
    b    -- intercept parameter

    Returns:
    (scalar) log likelihood of (x,y)
    """
    log_npdf = lambda x, m, v: -0.5 * np.log(2* np.pi * v) - 0.5 * (x - m)**2/v
    return np.sum(log_npdf(y, predict(x,a,b), sigma2))


def log_posterior(x, y, a, b, m, S, sigma2):
    """Returns the log posterior at (a,b), given the data (x,y) and the prior parameters (m, S).

    Arguments:
    x    -- (Nx1) vector of inputs
    y    -- (Nx1) vector of responses
    a    -- slope parameter
    b    -- intercept parameter
    m    -- (2x1) prior mean of (a, b)
    S    -- (2x2) prior covariance of (a, b)
    sigma2 -- (scalar) noise variance

    Returns:
    (scalar) log posterior at (a, b)
    """
    return log_prior(a, b, m, S) + log_likelihood(x, y, a, b, sigma2)

def compute_posterior(x, y, m, S, sigma2):
    """ return the posterior mean and covariance of w given (x,y)
    and hyperparameters m, S and sigma2

    Arguments:
    x      -- (Nx1) vector of inputs
    y      -- (Nx1) vector of responses
    m      -- (Dx1) prior mean
    S      -- (DxD) prior covariance
    sigma2 -- (scalar) noise variance

    Returns:
    mu     -- (Dx1) posterior mean
    Sigma  -- (DxD) posterior covariance

    """
    import scipy as sc
    Sinv = np.linalg.inv(S)
    X = design_matrix(x)
    Sigmainv = Sinv + np.dot(X.T, X)/sigma2
    L = sc.linalg.cho_factor(Sigmainv)
    scaled_mu = np.divide(np.dot(X.T, y), sigma2).reshape(-1, 1) + np.linalg.solve(S, m)
    mu = sc.linalg.cho_solve(L, scaled_mu)
    Sigma = sc.linalg.cho_solve(L, np.identity(len(m)))

    assert mu.shape == m.shape
    assert S.shape == Sigma.shape

    return mu, Sigma

The Visualization

Visualization of the prior density, the likelihood and the posterior as a function of aa and bb.

gaussian_linear_regression.py
def plot_prior_density(m, S, a, b):
    A_array, B_array = np.meshgrid(a, b)
    Z = np.array([log_prior(ai, bi, m, S) for (ai, bi) in zip(A_array.ravel(), B_array.ravel())])
    Z = Z.reshape((len(a), len(b)))
    plt.contour(a, b, log_to_density(Z), cmap='plasma')
    plt.xlabel('slope')
    plt.ylabel('intercept')
    plt.gca().set_aspect('equal', adjustable='box')

def plot_likelihood(x, y, sigma2, a, b):
    A_array, B_array = np.meshgrid(a, b)
    Z = np.array([log_likelihood(x, y, ai, bi, sigma2) for (ai, bi) in zip(A_array.ravel(), B_array.ravel())])
    Z = Z.reshape((len(a), len(b)))
    plt.contour(a, b, log_to_density(Z), 10, cmap='plasma')
    plt.xlabel('slope')
    plt.ylabel('intercept')
    plt.gca().set_aspect('equal', adjustable='box')

def plot_posterior_density(x, y, m, S, sigma2, a, b):
    A_array, B_array = np.meshgrid(a, b)
    Z = np.array([log_posterior(x, y, ai, bi, m, S, sigma2) for (ai, bi) in zip(A_array.ravel(), B_array.ravel())])
    Z = Z.reshape((len(a), len(b)))
    plt.contour(a, b, log_to_density(Z), 10, cmap='plasma')
    plt.xlabel('slope')
    plt.ylabel('intercept')
    plt.gca().set_aspect('equal', adjustable='box')

def log_to_density(Z):
    Z = Z - np.max(Z)
    Z = np.exp(Z)
    return Z/np.sum(Z)

def plot_posterior_credibility_region(x, y, m, S, sigma2, xp):
    mu, Sigma = compute_posterior(x, y, m, S, sigma2)
    mu_f, var_f = compute_f_posterior(xp, mu, Sigma)

    std_f = np.sqrt(var_f)
    mu_f = mu_f.reshape(-1)
    std_f = std_f.reshape(-1)

    plt.figure(figsize=(12, 4))
    plt.plot(x, y, 'k.', markersize=12, label='observations')
    plt.plot(xp, mu_f, 'b', label='mean prediction')
    plt.fill_between(xp, mu_f - 2 * std_f, mu_f + 2 * std_f, alpha=0.2, color='r', label='Posterior credibility region')
    plt.legend()
    plt.grid()

def plot_predictive_region(x, y, m, S, sigma2, xp):
    mu, Sigma = compute_posterior(x, y, m, S, sigma2)
    mu_f, var_f = compute_f_posterior(xp, mu, Sigma)
    mu_f = mu_f.reshape(-1)

    var_y = var_f + sigma2
    std_y = np.sqrt(var_y)
    std_y = std_y.reshape(-1)

    plt.plot(x, y, 'k.', markersize=12, label='observations')
    plt.plot(xp, mu_f, 'b', label='mean prediction')
    plt.fill_between(
        xp,
        mu_f - 2 * std_y,
        mu_f + 2 * std_y,
        alpha=0.2,
        color='r',
        label='Predictive 95% region'
    )
    plt.legend()
    plt.grid()

def rmse(mu_f, y):
    return np.sqrt(np.mean((y - mu_f)**2))

def mae(mu_f, y):
    return np.mean(np.abs(y - mu_f))

def r2(mu_f, y):
    return 1 - np.mean((y - mu_f)**2) / np.var(y)
Source
import gaussian_linear_regression as glr

a = np.linspace(-3,3,100)
b = np.linspace(-3,3,100)

# plot
plt.figure(figsize=(18,6))
plt.subplot(1, 3, 1)
glr.plot_prior_density(m, S, a, b)
plt.title('Prior')
plt.grid()

plt.subplot(1, 3, 2)
glr.plot_likelihood(x, y, sigma2, a, b)
plt.title('Likelihood')
plt.grid()

plt.subplot(1, 3, 3)
glr.plot_posterior_density(x, y, m, S, sigma2, a, b)
plt.title('Posterior')
plt.grid()
<Figure size 1800x600 with 3 Axes>

Analytical expression of the posterior distribution

Derivation of the posterior p(wy)=N(wμ,Σ)p(\mathbf{w}\big|\mathbf{y}) = \mathcal{N}\left(\mathbf{w}\big|\mu, \Sigma\right) yields:

Σ=(1σ2XTX+S1)1μ=Σ(1σ2XTy+S1m)\begin{align} \Sigma &= \left(\frac{1}{\sigma^2}\mathbf{X}^T\mathbf{X} + \mathbf{S}^{-1}\right)^{-1}\\ \mu &= \Sigma\left(\frac{1}{\sigma^2}\mathbf{X}^T\mathbf{y} + \mathbf{S}^{-1}\mathbf{m}\right) \end{align}

Reference: Rasmussen and Williams, Section 2.1.1, for the derivation of these functions (http://www.gaussianprocess.org/gpml/ ).

The Implementation

Implementation of the function for computing the analytical posterior distribution.

gaussian_linear_regression.py
def compute_posterior(x, y, m, S, sigma2):
    """ return the posterior mean and covariance of w given (x,y)
    and hyperparameters m, S and sigma2

    Arguments:
    x      -- (Nx1) vector of inputs
    y      -- (Nx1) vector of responses
    m      -- (Dx1) prior mean
    S      -- (DxD) prior covariance
    sigma2 -- (scalar) noise variance

    Returns:
    mu     -- (Dx1) posterior mean
    Sigma  -- (DxD) posterior covariance

    """
    import scipy as sc
    Sinv = np.linalg.inv(S)
    X = design_matrix(x)
    Sigmainv = Sinv + np.dot(X.T, X)/sigma2
    L = sc.linalg.cho_factor(Sigmainv)
    scaled_mu = np.divide(np.dot(X.T, y), sigma2).reshape(-1, 1) + np.linalg.solve(S, m)
    mu = sc.linalg.cho_solve(L, scaled_mu)
    Sigma = sc.linalg.cho_solve(L, np.identity(len(m)))

    assert mu.shape == m.shape
    assert S.shape == Sigma.shape

    return mu, Sigma

def design_matrix(x):
    """ returns the design matrix for a vector of input values x

    Arguments:
    x    -- (Nx1) vector of inputs

    Returns:
    (Nx2) design matrix

    """
    X = np.column_stack((x, np.ones(len(x))))
    return X
  • Computes the inverse of the posterior covariance matrix Σpost1=S1+1σ2XX\Sigma_{\text{post}}^{-1} = S^{-1} + \frac{1}{\sigma^2} X^\top X using the inverse of the prior covariance Sinv = np.linalg.inv(S) and the design matrix X = design_matrix(x).

  • Forms the Cholesky factorization of the inverse of the inverse posterior covariance matrix using sc.linalg.cho_factor(Sigmainv), enabling stable and efficient solutions for both the posterior mean and covariance without explicitly inverting matrices.

  • Computes the scaled right-hand side 1σ2Xy+S1m\frac{1}{\sigma^2} X^\top y + S^{-1} m using np.dot(X.T, y)/sigma2 + np.linalg.solve(S, m); the call to np.linalg.solve computes S1mS^{-1}m more accurately than explicitly forming S1S^{-1}.

  • Solves the linear system Σpost1μpost=1σ2Xy+S1m\Sigma_{\text{post}}^{-1}\,\mu_{\text{post}} = \frac{1}{\sigma^2}X^\top y + S^{-1} m via sc.linalg.cho_solve(L, scaled_mu), yielding the posterior mean vector mu.

  • Computes the posterior covariance matrix Σpost=(Σpost1)1\Sigma_{\text{post}} = (\Sigma_{\text{post}}^{-1})^{-1} by solving cho_solve(L, I) where I = np.identity(len(m)); this avoids explicit matrix inversion and ensures numerical stability.

The Application

The posterior mean and covariance for the data set (x,y)(\mathbf{x}, \mathbf{y}) are calculated using the function above. The plot of the resulting density matches with the posterior distribution from section 1.3.3.

Source
import gaussian_linear_regression as glr

a = np.linspace(-3,3,100)
b = np.linspace(-3,3,100)

mu, Sigma = glr.compute_posterior(x, y, m, S, sigma2)

A_array, B_array = np.meshgrid(a, b)
AB = np.column_stack((A_array.ravel(), B_array.ravel()))
from scipy.stats import multivariate_normal as mvn
vals = np.array([mvn.logpdf(ab, mu.ravel(), Sigma) for ab in AB]).reshape((len(a), len(b)))

# plot
plt.figure(figsize=(12,6))
plt.subplot(1, 2, 1)
plt.contour(a, b, glr.log_to_density(vals), 10, cmap='plasma')
plt.title('Analytical posterior density')
plt.xlabel('slope')
plt.ylabel('intercept')
plt.gca().set_aspect('equal', adjustable='box')
plt.grid()

plt.subplot(1, 2, 2)
glr.plot_posterior_density(x,y,m,S, sigma2, a, b)
plt.title('Numerical posterior density')
plt.xlabel('slope')
plt.ylabel('intercept')
plt.gca().set_aspect('equal', adjustable='box')
plt.grid()
<Figure size 1200x600 with 2 Axes>

Combining all of the above

All the functions implemented above are used to compute the posterior distribution p(wy)p\left(\mathbf{w}|\mathbf{y}\right) for various sizes of NN. The cell below analyses the data set using NN data points, where the columns visualise (i) data and sample functions, (ii) the prior, (iii) the likelihood and (iv) posterior.

gaussian_linear_regression.py
def generate_mvn_samples(mu, Sigma, M):
    """ return samples from a multivariate normal distribution N(mu, Sigma)

    Arguments:
    mu      -- (Dx1) mean vector
    Sigma   -- (DxD) covariance matrix
    M       -- (scalar) number of samples

    Returns:
    (DxM) matrix, where each column corresponds to a sample
    """

    jitter = 1e-8
    D = len(mu)
    L = np.linalg.cholesky(Sigma + jitter*np.identity(D))
    zs = np.random.normal(0, 1, size=(D, M))
    fs = mu + np.dot(L, zs)
    return fs

def compute_f_posterior(x, mu, Sigma):
    """ compute the posterior distribution of f(x) wrt. posterior distribution N(mu, Sigma)

    Arguments:
    x      -- (Nx1) vector of inputs
    mu     -- (2x1) mean vector
    Sigma  -- (2x2) covariance matrix

    Returns:
    mu_f   -- (Nx1) vector of pointwise posterior means at x
    var_f  -- (Nx1) vector of pointwise posterior variances at x

    """

    X = np.column_stack((x, np.ones(len(x))))
    mu_f = np.dot(X, mu)
    var_f = np.diag(np.dot(np.dot(X, Sigma), X.T))[:, None]

    return mu_f, var_f

def plot_prior_density(m, S, a, b):
  • generate_mvn_samples samples from a multivariate normal N(μ,Σ) \mathcal{N}(\mu, \Sigma) by computing a Cholesky factor L = np.linalg.cholesky(Sigma + jitter*np.identity(D)), where a small jitter term ensures that Σ\Sigma is numerically positive definite.

  • Standard normal samples zs are drawn with np.random.normal(0, 1, size=(D, M)), giving a D×MD \times M matrix of i.i.d. N(0,1) \mathcal{N}(0, 1) variables; multiplying by L linearly transforms them to have covariance Σ\Sigma.

  • The final samples are constructed as fs = mu + np.dot(L, zs), implementing f=μ+Lz f = \mu + L z ; this uses NumPy broadcasting to add the (D)-dimensional mean vector mu to each of the (M) sample columns.

  • In compute_f_posterior, the design matrix X = np.column_stack((x, np.ones(len(x)))) encodes a linear model f(x)=w1x+w0 f(x) = w_1 x + w_0 , with the first column for the input xx and the second column for the intercept term.

  • Given a posterior over weights wN(μ,Σ) w \sim \mathcal{N}(\mu, \Sigma) , the induced posterior over function values is f(x)N(Xμ,  XΣX) f(x) \sim \mathcal{N}(X\mu,\; X \Sigma X^\top) ; this is implemented by mu_f = np.dot(X, mu) for the mean.

  • The pointwise posterior variances Var[f(xi)]\mathrm{Var}[f(x_i)] are the diagonal elements of XΣXX \Sigma X^\top; these are extracted via np.diag(np.dot(np.dot(X, Sigma), X.T))[:, None], with [:, None] reshaping the vector into an (N×1)(N \times 1) column.

Source
import gaussian_linear_regression as glr

# New data set with more data points
np.random.seed(21340)
N_2 = 20
x_2 = np.array([1.764, 0.4, 0.979, 2.241, 1.868, -0.977,  0.95, -0.151, -0.103, 0.411, 0.144, 1.454, 0.761, 0.122,
              0.444, 0.334, 1.494, -0.205,  0.313, -0.854])[:, None]
y_2 = np.array([-0.464, 2.024, 3.191, 2.812, 6.512, -3.022, 1.99, 0.009, 2.513, 3.194, 0.935, 3.216, 0.386, -2.118,
               0.674, 1.222, 4.481, 1.893, 0.422, -1.209])[:, None]
xp = np.linspace(-3, 3, 50)[:, None]

a = np.linspace(-3,3,100)
b = np.linspace(-3,3,100)

def plot_BLR(n=2):
    # compute posterior & generate samples
    mu, Sigma = glr.compute_posterior(x_2[:n, :], y_2[:n, :], m, S, sigma2)
    ahat, bhat = glr.generate_mvn_samples(mu, Sigma, M=30)

    # plot
    plt.figure(figsize=(20, 10))

    plt.subplot2grid((1, 4), (0, 0))
    plt.plot(x_2[:n, :], y_2[:n, :], 'k.', markersize=12)
    plt.ylabel('N = %d' % n, fontweight='bold')
    for (ai, bi) in zip(ahat, bhat):
        plt.plot(xp, glr.predict(xp, ai, bi), 'b-', alpha=0.25)
    mu_f, var_f = glr.compute_f_posterior(xp, mu, Sigma)
    std_f = np.sqrt(var_f)
    plt.plot(xp, mu_f, 'r')
    plt.plot(xp, mu_f + 2*std_f, 'r--')
    plt.plot(xp, mu_f - 2*std_f, 'r--')
    plt.ylim((-6, 6))
    plt.title('Data points', fontweight='bold')
    plt.gca().set_box_aspect(1)
    plt.grid()

    plt.subplot2grid((1, 4), (0, 1))
    glr.plot_prior_density(m, S, a, b)
    plt.title('Prior', fontweight='bold')
    plt.gca().set_aspect('equal', adjustable='box')
    plt.grid()

    plt.subplot2grid((1, 4), (0, 2))
    glr.plot_likelihood(x_2[:n, :], y_2[:n, :], sigma2, a, b)
    plt.title('Likelihood', fontweight='bold')
    plt.gca().set_aspect('equal', adjustable='box')
    plt.grid()

    plt.subplot2grid((1, 4), (0, 3))
    glr.plot_posterior_density(x_2[:n, :], y_2[:n, :], m, S, sigma2, a, b)
    plt.plot(ahat, bhat, 'b.', markersize=10)
    plt.title('Posterior', fontweight='bold')
    plt.gca().set_aspect('equal', adjustable='box')
    plt.grid()

    plt.show()

plot_BLR(n=2)
<Figure size 2000x1000 with 4 Axes>

The Quiz

1. How does the blue lines in the first column relate to the blue dots in the fourth column?

Each blue line in the first column is a straight line in the form of linear regression: y=slopex+intercepty = slope * x + intercept, whose parameters slope and intercept correspond to one blue point in the fourth column, which is a pair of slope and coefficient in the posterior contour.

2. What happens as the number of samples increase?

Prior contours stay the same as the number of samples increases.

Likelihood contours become dramatically narrower as the number of samples increase

Posterior contours becomes smaller like likelihood contours, but not so significant as the number of samples increases.

Posterior draws: the range of the posterior draws becomes narrower, which tries to fit the data as much as possible as the number of samples increases.

The reason for this behavior is the following: more data points imply more information, which helps reduce uncertainty in regression and makes likelihood and posterior draws narrower

3. Why does the prior have so little effect in the last row?

The amount of data overwhelms the prior: in the last case, the amount of data available is large enough such that the likelihood completely overwhelms the prior, making the prior effect negligible in posterior draws.

Prior is not informative: If the prior is not informative, i.e., it does not provide much useful information about the distribution of the parameters, then it is unlikely to have a significant effect on the posterior. This is true in this case as the prior has zero mean and variance of one, which provideds little prior knowledge about the underlying data structure when there are many datapoints already.

House Price Example

# House Price Dataset

x = np.array([50,55,59,61,79,81,88,90,91,97,99,105,107,110,111,112,116,117,121,123,124,125,135,141, 142,144,145,149, 150,151])
y = np.array([0.36,0.37,0.28,0.29,0.3, 0.5,0.58,0.61,0.62,0.78,0.77,0.83,0.78,0.84,0.91,0.95,1.05,0.99,0.97,0.93,0.81, 0.9,1.1, 0.98, 0.88, 1.05, 1.02, 1.1, 1.08, 1.12])
Source
plt.figure()
plt.scatter(x,y, label='Data points', color='k')
plt.ylabel("Price", fontsize=16)
plt.xlabel("Area [$m^2$]", fontsize=16)
plt.xlim((40, 170))
plt.ylim((0, 1.3))
plt.xticks(fontsize=16)
plt.yticks(fontsize=16)
plt.text(70, 0.35, "?", fontsize=30, color='blue', ha='center')
plt.text(160, 1.05, "?", fontsize=30, color='blue', ha='center')
plt.grid()
plt.legend(loc='lower right', fontsize=16)
plt.tight_layout()
<Figure size 400x400 with 1 Axes>
Source
# Noise of prior distribution
m = np.zeros((2, 1))
S = np.eye(2)*[0.1,0.1]
sigma2 = 0.3
print(S)
[[0.1 0. ]
 [0.  0.1]]
Source
import gaussian_linear_regression as glr

a_array = np.linspace(-0.001,0.02,200)
b_array = np.linspace(-1,1,200)

A_array, B_array = np.meshgrid(a_array, b_array)
Z = np.array([glr.log_prior(ai, bi, m, S) for (ai, bi) in zip(A_array.ravel(), B_array.ravel())])
Z = Z.reshape((len(a_array), len(b_array)))

plt.figure(figsize=(4,4))
plt.contour(a_array, b_array, glr.log_to_density(Z), cmap='plasma')
plt.xlabel('slope')
plt.ylabel('intercept')
plt.title('Prior')
plt.grid()
<Figure size 400x400 with 1 Axes>
Source
import gaussian_linear_regression as glr

A_array, B_array = np.meshgrid(a_array, b_array)
Z = np.array([glr.log_likelihood(x, y, ai, bi, sigma2) for (ai, bi) in zip(A_array.ravel(), B_array.ravel())])
Z = Z.reshape((len(a_array), len(b_array)))

plt.figure(figsize=(4,4))
plt.contour(a_array, b_array, glr.log_to_density(Z), 10, cmap='plasma')
plt.xlabel('slope')
plt.ylabel('intercept')
plt.title('Likelihood')
plt.grid()
<Figure size 400x400 with 1 Axes>
Source
import gaussian_linear_regression as glr

A_array, B_array = np.meshgrid(a_array, b_array)
Z = np.array([glr.log_posterior(x, y, ai, bi, m, S, sigma2) for (ai, bi) in zip(A_array.ravel(), B_array.ravel())])
Z = Z.reshape((len(a_array), len(b_array)))

plt.figure(figsize=(4,4))
plt.contour(a_array, b_array, glr.log_to_density(Z), 10, cmap='plasma')
plt.xlabel('slope')
plt.ylabel('intercept')
plt.title('Posterior')
plt.grid()
<Figure size 400x400 with 1 Axes>
Source
import gaussian_linear_regression as glr

mu, Sigma = glr.compute_posterior(x[:,None], y[:, None], m, S, sigma2)

xp = np.linspace(40, 200, 50)
mu_f, var_f = glr.compute_f_posterior(xp, mu, Sigma)
std_f = np.sqrt(var_f)
mu_f = mu_f.reshape(-1)
std_f = std_f.reshape(-1)

plt.figure(figsize=(12,4))
plt.plot(x, y, 'k.', markersize=12);
plt.plot(xp, mu_f, 'b', label='mean prediction')
plt.fill_between(xp, mu_f - 2*std_f, mu_f + 2*std_f, alpha=0.2, color='r', label='Posterior credibility region')
plt.xlabel('Area [$m^2$]')
plt.ylabel('Price')
plt.legend()
plt.grid()
<Figure size 1200x400 with 1 Axes>
Source
var_y = var_f + sigma2

std_y = np.sqrt(var_y)
mu_f = mu_f.reshape(-1)
std_y = std_y.reshape(-1)

plt.figure(figsize=(12, 4))
plt.plot(x, y, 'k.', markersize=12, label='observations')
plt.plot(xp, mu_f, 'b', label='predictive mean')
plt.fill_between(
    xp,
    mu_f - 2 * std_y,
    mu_f + 2 * std_y,
    alpha=0.2,
    color='r',
    label='Predictive 95% region'
)
plt.xlabel('Area [$m^2$]')
plt.ylabel('Price')
plt.legend()
plt.grid()
<Figure size 1200x400 with 1 Axes>

Exercises

Problem 1.1 : Modeling Credit data with Simple Linear Regression

In this exercise, you will perform the basic steps of Bayesian linear regression. To this end, you will work with the Credit dataset that you can load with Pandas from credit_data.csv.

a) Data exploration

Load the dataset and display the first five rows. Identify the predictor (xx) and the response (yy) variables.

Source
credit_df = pd.read_csv('data/credit_data.csv')
credit_df.head(5)
Loading...
  • xx = Limit

  • yy = Balance

b) Data visualization

Plot the data points of Balance vs. Limit. Describe the relationship you observe. Does a linear model seem appropriate?

Source
plt.scatter(credit_df['Limit'], credit_df['Balance'])
plt.xlabel('Limit')
plt.ylabel('Balance')
<Figure size 400x400 with 1 Axes>

The relationship appears to be linear - except for points close to zero. They seem rather arbitrarily distributed.

c) Model definition

Consider the linear model y=ax+b+ε,  εN(0,σ2).y = ax + b + \varepsilon,\ \ \varepsilon \sim \mathcal{N}(0,\sigma^2). Explain the meaning of the parameters aa, bb, and σ2\sigma^2.

  • aa is the slope - change in Balance per unit of Limit

  • bb is the intercept - expected Balance when Limit is zero

  • σ2\sigma^2 is the noise variance - the unexplained variation in the data

d) Choosing priors

Plot the prior distribution using the functions derived in the lectures. Assume a Gaussian prior for w=[a,b]T\mathbf{w} = [a,b]^T: p(w)=N(m,Σp)p(\mathbf{w}) = \mathcal{N}(m,\Sigma_p), Σp=106I\Sigma_p = 10^6 I. What does this prior represent? How would the model change if Σp\Sigma_p were smaller, such as Σp=102I\Sigma_p = 10^2 I? How would that affect the posterior and the predictive region?

Source
a = np.linspace(-2000, 2000, 100)
b = np.linspace(-2000, 2000, 100)

plt.figure(figsize=(10, 4))

m = np.zeros((2, 1))
S = 1e6 * np.identity(2)
plt.subplot(1, 2, 1)
glr.plot_prior_density(m, S, a, b)
plt.title(f"Prior with $\Sigma_p = 10^6 I =$ {S}")

m = np.zeros((2, 1))
S = 1e2 * np.identity(2)
plt.subplot(1, 2, 2)
glr.plot_prior_density(m, S, a, b)
plt.title(f"Prior with $\Sigma_p = 10^2 I =$ {S}")
<Figure size 1000x400 with 2 Axes>

Using smaller diagonal values in Σp\Sigma_p means, there is less variance in the parameters. Less variance in turn means that the posterior distribution is more peaked around the true values of aa and bb - because there is less uncertainty.

e) Computing the posterior

Plot the likelihood and the posterior distribution using the functions derived in the lectures. Compute the posterior mean and covariance of w\mathbf{w}. Compare and interpret the shapes of the prior, likelihood, and posterior contour plots.

Source
x = credit_df['Limit']
y = credit_df['Balance']

m = np.zeros((2, 1))
S = 1e6 * np.identity(2) # std = 1'000
sigma2 = 1e5 # std of noise ≈ 317

# from scatter plot we expect slope 1250/10'000 ≈ 0.125:
a = np.linspace(0.12, 0.20, 100)
# from scatter plot we expect intercept between 0 and -500,
# from there we fine-tuned the numbers for a nicer plot:
b = np.linspace(-200, -400, 100)

plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
glr.plot_likelihood(x, y, sigma2, a, b)
plt.title(f"Likelihood")
plt.grid()
plt.gca().set_aspect('auto')

plt.subplot(1, 2, 2)
glr.plot_posterior_density(x, y, m, S, sigma2, a, b)
plt.title(f"Posterior")
plt.grid()
plt.gca().set_aspect('auto')
<Figure size 1000x400 with 2 Axes>
f) Posterior credibility distribution

Using the posterior parameters, calculate the posterior predictive mean and posterior variance: μf(x)\mu_f(x), Var[f(x)]\mathrm{Var}[f(x)], and plot these together with the scatter plot. Explain what each component (mean and variance) represents. Explain why the posterior credibility region widens toward the edges of the data.

gaussian_linear_regression.py
def plot_posterior_credibility_region(x, y, m, S, sigma2, xp):
    mu, Sigma = compute_posterior(x, y, m, S, sigma2)
    mu_f, var_f = compute_f_posterior(xp, mu, Sigma)

    std_f = np.sqrt(var_f)
    mu_f = mu_f.reshape(-1)
    std_f = std_f.reshape(-1)

    plt.figure(figsize=(12, 4))
    plt.plot(x, y, 'k.', markersize=12, label='observations')
    plt.plot(xp, mu_f, 'b', label='mean prediction')
    plt.fill_between(xp, mu_f - 2 * std_f, mu_f + 2 * std_f, alpha=0.2, color='r', label='Posterior credibility region')
    plt.legend()
    plt.grid()
Source
import gaussian_linear_regression as glr

x = credit_df['Limit']
y = credit_df['Balance']

m = np.zeros((2, 1))
S = 1e6 * np.identity(2) # std = 1'000
sigma2 = 1e5 # std of noise ≈ 317

xp = np.linspace(0, 14_000, 100)
glr.plot_posterior_credibility_region(x, y, m, S, sigma2, xp)
plt.xlabel('Limit')
plt.ylabel('Balance')
<Figure size 1200x400 with 1 Axes>
  • μf(x)\mu_f(x) is the most likely linear regression function (the mean of all possible linear functions according to our model)

  • Var[f(x)]\mathrm{Var}[f(x)] is the expected uncertainty at every value of x

g) Predictive region

By computing the predictive distribution, plot the predictive region along with the scatter plot. Experiment with different values of σ2\sigma^2 (e.g. 103,104,105,10610^3, 10^4, 10^5, 10^6). How does changing σ2\sigma^2 affect the posterior credibility region and predictive region? Which value seems most reasonable for this dataset?

gaussian_linear_regression.py
def plot_predictive_region(x, y, m, S, sigma2, xp):
    mu, Sigma = compute_posterior(x, y, m, S, sigma2)
    mu_f, var_f = compute_f_posterior(xp, mu, Sigma)
    mu_f = mu_f.reshape(-1)

    var_y = var_f + sigma2
    std_y = np.sqrt(var_y)
    std_y = std_y.reshape(-1)

    plt.plot(x, y, 'k.', markersize=12, label='observations')
    plt.plot(xp, mu_f, 'b', label='mean prediction')
    plt.fill_between(
        xp,
        mu_f - 2 * std_y,
        mu_f + 2 * std_y,
        alpha=0.2,
        color='r',
        label='Predictive 95% region'
    )
    plt.legend()
    plt.grid()
Source
import gaussian_linear_regression as glr

x = credit_df['Limit']
y = credit_df['Balance']

m = np.zeros((2, 1))
S = 1e6 * np.identity(2) # std = 1'000
sigmas2 = [1e3, 1e4, 1e5, 1e6] # variance of noise

plt.figure(figsize=(12,22))

xp = np.linspace(0, 14_000, 100)

for i, sigma2 in enumerate(sigmas2):
    plt.subplot(len(sigmas2), 1, i+1)

    glr.plot_predictive_region(x, y, m, S, sigma2, xp)

    plt.xlabel('Limit')
    plt.ylabel('Balance')
    plt.title(f"Noise $\sigma^2 = {sigma2:.0e}$")
<Figure size 1200x2200 with 4 Axes>

Using σ2=1e5\sigma^2 = 1e5 looks best; it explains most of the variance and its neither too broad nor too narrow.

h) Model performance

Compute and interpret the following metrics for your model: RMSE, MAE, and R2R^2. What do these values indicate about model accuracy?

gaussian_linear_regression.py
def rmse(mu_f, y):
    return np.sqrt(np.mean((y - mu_f)**2))

def mae(mu_f, y):
    return np.mean(np.abs(y - mu_f))

def r2(mu_f, y):
    return 1 - np.mean((y - mu_f)**2) / np.var(y)
import gaussian_linear_regression as glr

x = credit_df['Limit']
y = credit_df['Balance']

m = np.zeros((2, 1))
S = 1e6 * np.identity(2) # std = 1'000
sigmas2 = 1e5 # std of noise ≈ 317

mu, Sigma = glr.compute_posterior(x, y, m, S, sigma2)
mu_f, var_f = glr.compute_f_posterior(x, mu, Sigma)
mu_f = mu_f.reshape(-1)

print(f"Residual Mean Squared Error (RMSE): {glr.rmse(mu_f, y):.2f}")
print(f"Mean Absolute Error (MAE): {glr.mae(mu_f, y):.2f}")
print(f"R^2: {glr.r2(mu_f, y):.2f}")
Residual Mean Squared Error (RMSE): 233.01
Mean Absolute Error (MAE): 177.27
R^2: 0.74
i) Model comparison

Suppose we add a polynomial term x2x^2 to the model. How might this affect the results? Would Bayesian regression still prevent overfitting?

A polynomial model is more powerfull, but risks overfitting. Choosing a corresponding Bayesian prior can act as a regularization technique to prevent overfitting.

j) Reflection

Summarize how the prior, likelihood, and posterior interact. How do Σp\Sigma_p and σ2\sigma^2 influence the model predictions and uncertainty? Why is it important to interpret both the point estimates and the uncertainty bands when communicating regression results?

The prior describes the belief before seeing any data (our parameter initializations), the likelihood describes the probability of observing the given data using our belief (aka. parameters) and finally the posterior is our updated belief (parameters) after we’ve seen the data.

The Σp(S)\Sigma_p(S) describes the uncertainty of our prior (belief): Its diagonal values describe the variance for each parameter (higher variance = higher std. = higher uncertainty).

Lastly, σ2\sigma^2 describes the noise in our data (aka. aspects our model cannot explain). This covers the overall predictive uncertainty.

Bayesian linear regression with Bambi

Bambi is a high-level interface for Bayesian regression models built on top of PyMC. It uses an R-style formula interface.

For simple linear regression of body fat on BMI, one might specify

import bambi as bmb

model = bmb.Model("BodyFat ~ BMI", data=df)
results = model.fit()

Conceptually, this is equivalent to specifying the normal linear model

BodyFatiβ0,β1,σ,BMIiN(β0+β1BMIi,  σ2),\text{BodyFat}_i \mid \beta_0,\beta_1,\sigma,\text{BMI}_i \sim \mathcal{N}(\beta_0 + \beta_1 \,\text{BMI}_i,\; \sigma^2),

with weakly informative priors chosen automatically for β0\beta_0, β1\beta_1, and σ\sigma.

Bambi:

  • Builds the corresponding PyMC model and runs MCMC,

  • Returns posterior samples in an InferenceData object,

  • Provides tools for posterior summaries, HDIs, and predictive intervals.

More complex models (multiple predictors, categorical variables, interactions, non-normal likelihoods) can be specified by extending the formula or choosing a different family (likelihood).

Posterior HDIs for regression coefficients

From the posterior samples of regression coefficients, we can compute highest density intervals (HDIs).

For a coefficient βj\beta_j and a desired probability mass pp (e.g. 0.90 or 0.94), the pp-HDI is an interval [a,b][a,b] such that

  • P(βj[a,b]y,X)=pP(\beta_j \in [a,b] \mid \mathbf{y},X) = p,

  • posterior density inside the interval is higher than outside.

Practical interpretation:

  • If a 94%94\% HDI for β1\beta_1 does not include zero, we say that zero is not a plausible value for β1\beta_1 under the model and prior. In informal language: the effect is “significant at the 94% level”.

  • If the HDI does include zero, then values on both sides (positive and negative effect) are compatible with data and prior; we cannot confidently say the effect is non-zero.

Bambi and ArviZ compute HDIs directly from the posterior samples.

Bayesian R2R^2 for regression

In classical linear regression, the coefficient of determination R2R^2 measures the fraction of variance explained by the model.

In a Bayesian setting, we can define a Bayesian R2R^2 (Gelman et al.), based on posterior samples.

For each posterior draw s=1,,Ss = 1,\dots,S:

  1. Compute the vector of predicted means (fitted values)

    μ(s)=(μ1(s),,μn(s)),μi(s)=E[Yixi,θ(s)].\boldsymbol{\mu}^{(s)} = \big(\mu_1^{(s)}, \dots, \mu_n^{(s)}\big)^\top, \quad \mu_i^{(s)} = \mathbb{E}[Y_i \mid x_i,\theta^{(s)}].
  2. Compute residuals

    e(s)=yμ(s).\mathbf{e}^{(s)} = \mathbf{y} - \boldsymbol{\mu}^{(s)}.
  3. Compute the explained variance and residual variance for draw ss:

    Vfit(s)=Vari(μi(s)),Vres(s)=Vari(ei(s)),V_{\text{fit}}^{(s)} = \operatorname{Var}_i\big(\mu_i^{(s)}\big), \qquad V_{\text{res}}^{(s)} = \operatorname{Var}_i\big(e_i^{(s)}\big),

    where the variance is over data points i=1,,ni = 1,\dots,n.

Define the per-draw R2R^2 as

R2,(s)=Vfit(s)Vfit(s)+Vres(s).R^{2,(s)} = \frac{V_{\text{fit}}^{(s)}}{V_{\text{fit}}^{(s)} + V_{\text{res}}^{(s)}}.

The Bayesian R2R^2 is then summarized by the posterior distribution of {R2,(s)}s=1S\{R^{2,(s)}\}_{s=1}^{S}, e.g. by its mean and an HDI.

This approach acknowledges:

  • Uncertainty in the regression function (epistemic),

  • Uncertainty in the noise (aleatoric).

Multiple linear regression

With pp predictors, the linear regression model generalizes to

Yi=β0+β1xi1+β2xi2++βpxip+εi,εiN(0,σ2).Y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \dots + \beta_p x_{ip} + \varepsilon_i, \quad \varepsilon_i \sim \mathcal{N}(0,\sigma^2).

In matrix notation:

yβ,σ,XN(Xβ,σ2In),\mathbf{y} \mid \boldsymbol{\beta},\sigma,X \sim \mathcal{N}(X \boldsymbol{\beta}, \sigma^2 I_n),

where now XX is an n×(p+1)n \times (p+1) matrix (including the intercept column of ones).

Bayesian multiple linear regression:

  • Place a prior on the coefficient vector β\boldsymbol{\beta} and σ\sigma,

  • Use MCMC (via PyMC/Bambi) to obtain posterior samples p(β,σy,X)p(\boldsymbol{\beta},\sigma \mid \mathbf{y},X),

  • Summarize and interpret parameter and predictive uncertainty.

Adding predictors generally:

  • Increases the potential explained variance,

  • Increases epistemic uncertainty (more parameters to estimate),

  • Risks overfitting if not regularized or if model is too flexible for the data size.

Categorical predictors and interactions

Categorical predictors (e.g. Student yes/no, or Ethnicity with several categories) are incorporated via dummy variables.

Example: binary predictor Student (0 = no, 1 = yes):

E[YLimit,Student]=β0+β1Limit+β2Student.\mathbb{E}[Y \mid \text{Limit},\text{Student}] = \beta_0 + \beta_1 \,\text{Limit} + \beta_2 \,\text{Student}.
  • β2\beta_2 shifts the mean response up or down for students compared to non-students.

Interactions between predictors allow the effect of one predictor to depend on another.

For two continuous predictors x1x_1 (TV budget) and x2x_2 (radio budget), a model with interaction is

E[Yx1,x2]=β0+β1x1+β2x2+β3(x1x2).\mathbb{E}[Y \mid x_1,x_2] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 (x_1 x_2).
  • β3\beta_3 captures synergy (if positive) or redundancy (if negative) between channels.

In Bambi’s formula syntax:

  • Main effects: Sales ~ TV + radio

  • With interaction: Sales ~ TV * radio

where TV * radio expands to TV + radio + TV:radio.

Robust linear regression with Student’s tt likelihood

The normal likelihood is sensitive to outliers: large residuals are heavily penalized, which can pull the regression line towards outlying points.

To reduce sensitivity to outliers, we can model residuals with a Student’s tt distribution:

Yiμi,σ,νtν(μi,σ),μi=β0+β1xi+,Y_i \mid \mu_i,\sigma,\nu \sim t_{\nu}(\mu_i, \sigma), \qquad \mu_i = \beta_0 + \beta_1 x_i + \dots,

where

  • ν>0\nu > 0 is the degrees of freedom parameter (shape),

  • smaller ν\nu yields heavier tails than the normal distribution.

As ν\nu \to \infty, the Student’s tt distribution converges to the normal:

tν(μ,σ)  ν  N(μ,σ2).t_{\nu}(\mu,\sigma) \;\xrightarrow[\nu \to \infty]{}\; \mathcal{N}(\mu,\sigma^2).

A Bayesian robust regression model might use:

  • Normal priors for coefficients β\boldsymbol{\beta},

  • A prior on σ>0\sigma > 0 (half-normal, exponential, etc.),

  • A prior on ν\nu (e.g. νGamma(s,r)\nu \sim \operatorname{Gamma}(s,r)).

Compared to the normal-likelihood model, the tt-likelihood:

  • Gives less weight to extreme residuals,

  • Produces regression lines less distorted by outliers,

  • Often yields broader predictive intervals, reflecting uncertainty about outliers.

Polynomial regression

To capture simple nonlinear relationships between a single predictor xx and response yy, we can use polynomial regression.

For a degree-3 polynomial:

E[YX=x]=β0+β1x+β2x2+β3x3.\mathbb{E}[Y \mid X = x] = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3.

This is still linear in the parameters β\boldsymbol{\beta}, so it fits into the linear regression framework by defining transformed predictors:

zi1=xi,zi2=xi2,zi3=xi3.z_{i1} = x_i,\quad z_{i2} = x_i^2,\quad z_{i3} = x_i^3.

However:

  • Polynomial terms can be strongly correlated (e.g. xx and x2x^2), causing numerical issues.

  • High-degree polynomials can overfit heavily, especially with few data points.

Bambi’s poly(x, d) construct can create orthogonal polynomials, which reduce the correlations between polynomial terms and improve sampling stability.

Overfitting and epistemic uncertainty in polynomial models

With few data points and a highly flexible model (e.g. high-degree polynomial), the model can fit the training data very closely, but:

  • The regression function becomes wiggly and unstable,

  • Posterior predictive intervals can become very wide away from data points,

  • Epistemic uncertainty increases, reflecting that many different polynomials are compatible with the data.

Bayesian inference:

  • Makes overfitting visible in the posterior distribution of functions and predictive intervals,

  • Encourages the use of priors (shrinkage) to regularize complex models,

  • Allows the use of model comparison (e.g. Bayes factors, ELPD) to decide whether increased flexibility is justified.

Shrinkage via normal priors and L2 regularisation

Consider a linear regression model with predictors XX (possibly including polynomial terms) and coefficients β\boldsymbol{\beta}. Suppose we place an independent normal prior on each coefficient:

βjN(0,τ2),j=1,,p,\beta_j \sim \mathcal{N}(0,\tau^2), \quad j = 1,\dots,p,

and a normal likelihood

yβ,σ,XN(Xβ,σ2In).\mathbf{y} \mid \boldsymbol{\beta},\sigma,X \sim \mathcal{N}(X\boldsymbol{\beta},\sigma^2 I_n).

Ignoring constants, the log posterior is

logp(βy,X,σ)=logp(yβ,σ,X)+logp(β)+const.\log p(\boldsymbol{\beta} \mid \mathbf{y},X,\sigma) = \log p(\mathbf{y} \mid \boldsymbol{\beta},\sigma,X) + \log p(\boldsymbol{\beta}) + \text{const}.

Up to additive constants, this becomes

logp(βy,X,σ)=12σ2yXβ2212τ2β22+const.\log p(\boldsymbol{\beta} \mid \mathbf{y},X,\sigma) = -\frac{1}{2\sigma^2} \|\mathbf{y} - X\boldsymbol{\beta}\|_2^2 -\frac{1}{2\tau^2} \|\boldsymbol{\beta}\|_2^2 + \text{const}.

The maximum a posteriori (MAP) estimate β^MAP\hat{\boldsymbol{\beta}}_{\text{MAP}} maximizes this log posterior, i.e. minimizes

12σ2yXβ22+12τ2β22.\frac{1}{2\sigma^2} \|\mathbf{y} - X\boldsymbol{\beta}\|_2^2 + \frac{1}{2\tau^2} \|\boldsymbol{\beta}\|_2^2.

Up to a constant factor, this is equivalent to ridge regression (L2 regularisation):

β^ridge=argminβ{yXβ22+λβ22},\hat{\boldsymbol{\beta}}_{\text{ridge}} = \arg\min_{\boldsymbol{\beta}} \Big\{ \|\mathbf{y} - X\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_2^2 \Big\},

with penalty parameter λ=σ2/τ2\lambda = \sigma^2 / \tau^2.

Thus:

  • L2 regularisation corresponds to a normal prior on coefficients centered at zero.

  • Stronger prior shrinkage (smaller τ\tau) corresponds to larger λ\lambda and stronger penalization of large coefficients.

L1 regularisation and Laplace prior

Analogously, L1 regularisation (lasso) corresponds to a Laplace (double-exponential) prior on the coefficients.

The Laplace prior for a scalar coefficient β\beta with scale b>0b > 0 has density

p(β)=12bexp ⁣(βb).p(\beta) = \frac{1}{2b} \exp\!\left(-\frac{|\beta|}{b}\right).

Ignoring constants, the log prior is

logp(β)=βb+const,\log p(\beta) = -\frac{|\beta|}{b} + \text{const},

so for a vector of coefficients β\boldsymbol{\beta} with independent Laplace priors, the negative log prior adds a term proportional to β1\|\boldsymbol{\beta}\|_1.

The MAP estimator then minimizes

yXβ22+λβ1,\|\mathbf{y} - X\boldsymbol{\beta}\|_2^2 + \lambda \|\boldsymbol{\beta}\|_1,

which is the lasso objective.

In summary:

  • Normal prior on β\boldsymbol{\beta} \Rightarrow ridge (L2) penalty at MAP.

  • Laplace prior on β\boldsymbol{\beta} \Rightarrow lasso (L1) penalty at MAP.

Regularisation can therefore be understood as a natural consequence of using informative priors in Bayesian models.

Model selection versus regularisation

When dealing with flexible models (e.g. polynomials of various degrees), there are two complementary strategies:

  1. Model selection
    Compare several models (e.g. linear, quadratic, cubic) using:

    • Bayes factors (marginal likelihoods),

    • ELPD / LOO,

    • Posterior predictive checks,

    • Predictive RMSE/MAE.

    Then choose the model that best balances fit and complexity.

  2. Regularisation (shrinkage)
    Use a richer model (e.g. higher-degree polynomial) but place strong priors that shrink higher-order coefficients towards zero. This keeps the model flexible but penalizes unnecessary complexity.

Often, a combination of both is used:

  • Start with a simple model,

  • Add complexity if strongly supported by posterior predictive checks and model comparison,

  • Use priors to regularize complex models and avoid overfitting.

In Bayesian regression, regularisation is not an ad-hoc fix but an integral part of the modelling philosophy through the choice of priors.