Model Checks, Model Selection, Multivariate Distributions

Posterior predictive checks¶

Bayesian models are generative: once we have a posterior $p(\theta \mid y)$ , we can simulate new data sets from the model.

For observed data $y$ and future (or replicated) data $y_{\text{rep}}$ , the posterior predictive distribution is

p(y_{\text{rep}} \mid y) = \int p(y_{\text{rep}} \mid \theta)\, p(\theta \mid y)\, d\theta.

(1)

A posterior predictive check compares:

the observed data $y$ , and
many replicated datasets $y_{\text{rep}}^{(1)},\dots,y_{\text{rep}}^{(S)}$ drawn from $p(y_{\text{rep}} \mid y)$ .

If the simulated data are, on average, very different from the observed data (for example, in mean, variance, tails, shape…), then the likelihood model $p(y \mid \theta)$ is likely misspecified.

Important points from the slides:

Priors “fade away” as we collect more data; the likelihood stays.
If the likelihood is structurally wrong, it will stay wrong no matter how much data we collect.
The goal is not to find a perfect model (none exists), but a model whose predictions are compatible with the data.

Poisson likelihood and its limitations for count data¶

For count data $Y \in \{0,1,2,\dots\}$ , a common starting point is the Poisson distribution:

P(Y = k \mid \lambda) = \frac{\lambda^{k} e^{-\lambda}}{k!}, \quad k = 0,1,2,\dots,\; \lambda > 0.

(2)

Moments:

Mean:
$\mathbb{E}[Y \mid \lambda] = \lambda$
(3)
Variance:
$\operatorname{Var}(Y \mid \lambda) = \lambda$
(4)

The key limitation:

The Poisson forces mean and variance to be equal.
Many real count data sets show overdispersion:
$\operatorname{Var}(Y) > \mathbb{E}[Y].$
(5)

Posterior predictive checks can reveal this: simulated data from the Poisson model may have too low variance compared to the observed data.

The negative binomial distribution¶

The negative binomial distribution is a flexible alternative to the Poisson for count data with overdispersion.

Classical interpretation:

Consider repeated Bernoulli trials with success probability $p$ .
Let $K$ be the number of failures observed before the $r$ -th success.

Then $K$ follows a negative binomial distribution with parameters $(r,p)$ :

P(K = k \mid r, p) = \binom{k + r - 1}{k}\, (1-p)^k\, p^r, \quad k = 0,1,2,\dots,\; r > 0,\; 0 < p < 1.

(6)

Moments in this parameterization:

Mean:
$\mathbb{E}[K \mid r,p] = r\, \frac{1-p}{p}$
(7)
Variance:
$\operatorname{Var}(K \mid r,p) = r\, \frac{1-p}{p^2}.$
(8)

We can reparameterize in terms of a mean $\lambda$ and overdispersion parameter $r$ .

Set

\lambda = r\, \frac{1-p}{p}, \quad \text{so that}\quad p = \frac{r}{r + \lambda}.

(9)

Substituting into the pmf yields an equivalent negative binomial distribution with parameters $(r, \lambda)$ ,

P(K = k \mid r, \lambda) = \binom{k + r - 1}{k} \left( \frac{\lambda}{\lambda + r} \right)^k \left( \frac{r}{\lambda + r} \right)^r.

(10)

In this form,

Mean:
$\mathbb{E}[K \mid r,\lambda] = \lambda$
(11)
Variance:
$\operatorname{Var}(K \mid r,\lambda) = \lambda + \frac{\lambda^2}{r}.$
(12)

Thus, for finite $r$ , we have overdispersion:

\operatorname{Var}(K \mid r,\lambda) > \mathbb{E}[K \mid r,\lambda].

(13)

In the limit $r \to \infty$ , the variance approaches the mean and the negative binomial distribution converges to a Poisson distribution with rate $\lambda$ .

This makes the negative binomial a natural generalization of the Poisson that allows the variance to be separately controlled via $r$ .

Posterior predictive checks: Poisson vs negative binomial¶

In the maternity-ward example (beds occupied per night):

A Poisson model might underestimate the variance of bed counts.
A negative binomial model allows the variance to exceed the mean and can better match the data.

Posterior predictive checks reveal that:

The Poisson model’s replicated data tend to have too little dispersion.
The negative binomial model’s replicated data better match both the mean and the spread of observed counts.

However:

Adding an extra parameter (the dispersion $r$ or $\alpha$ ) increases epistemic uncertainty.
There is a trade-off: more flexible models fit data better but are harder to estimate with limited data.

Model selection: general ideas¶

Model selection asks: which model family explains the data best?

Typical desiderata:

Good fit to the observed data.
Good predictive performance on unseen data.
Avoiding unnecessary complexity (overfitting).
Reasonable interpretability.

In a Bayesian framework, model selection can be approached in several ways:

Posterior model probabilities via marginal likelihoods and Bayes factors.
Expected log-predictive density (ELPD), typically via cross-validation.
Predictive-error metrics (RMSE, MAE) generalized to the Bayesian setting.

All three approaches are complementary and emphasize different aspects of model quality.

Approach 1: Bayesian model comparison via marginal likelihood¶

Treat each model $M_i$ as a hypothesis. For each model:

Parameters: $\theta_i$ ,
Prior: $p(\theta_i \mid M_i)$ ,
Likelihood: $p(y \mid \theta_i, M_i)$ .

The marginal likelihood (or model evidence) is

p(y \mid M_i) = \int p(y \mid \theta_i, M_i)\, p(\theta_i \mid M_i)\, d\theta_i.

(14)

Given prior model probabilities $P(M_i)$ , the posterior model probability is

P(M_i \mid y) = \frac{p(y \mid M_i)\, P(M_i)} {\sum_j p(y \mid M_j)\, P(M_j)}.

(15)

For two models $M_0$ and $M_1$ :

Prior odds:
$\frac{P(M_1)}{P(M_0)};$
(16)
Posterior odds:
$\frac{P(M_1 \mid y)}{P(M_0 \mid y)} = \operatorname{BF}_{10}\, \frac{P(M_1)}{P(M_0)},$
(17)
Bayes factor:
$\operatorname{BF}_{10} = \frac{p(y \mid M_1)}{p(y \mid M_0)}.$
(18)

Bayes factors quantify how much the data shift our odds between models.

Marginal likelihood: trade-off between accuracy and complexity¶

The log marginal likelihood admits a useful decomposition that highlights a trade-off between accuracy and complexity.

For a model $M$ with parameters $\theta$ , prior $p(\theta \mid M)$ , likelihood $p(y \mid \theta,M)$ , and posterior $p(\theta \mid y,M)$ , we have

\log p(y \mid M) = \mathbb{E}_{\theta \mid y,M}\big[\log p(y \mid \theta,M)\big] - \operatorname{KL}\!\big( p(\theta \mid y,M)\; \|\; p(\theta \mid M) \big),

(19)

where

$\mathbb{E}_{\theta \mid y,M}[\cdot]$ is expectation w.r.t. the posterior $p(\theta \mid y,M)$ ,
$\operatorname{KL}(p\|q)$ is the Kullback–Leibler divergence,
$\operatorname{KL}(p\|q) = \int p(\theta)\, \log \frac{p(\theta)}{q(\theta)}\, d\theta.$
(20)

Interpretation:

Accuracy term
$\mathbb{E}_{\theta \mid y,M}\big[\log p(y \mid \theta,M)\big]$
(21)
measures how well the model fits the data on average under the posterior.
Complexity term
$\operatorname{KL}\!\big( p(\theta \mid y,M)\; \|\; p(\theta \mid M) \big)$
(22)
measures the information gain from prior to posterior (always non-negative).

Models with very weak priors and many parameters tend to have:

high potential accuracy,
but also high complexity (large information gain),
which can reduce the marginal likelihood.

Hence, the marginal likelihood automatically implements an Occam’s razor: it balances fit against complexity.

Approach 2: Expected log-predictive density (ELPD)¶

An alternative view focuses on predictive performance: how well does a model predict new, unseen data?

Let $y_{\text{new}}$ be a future observation. The posterior predictive density is

p(y_{\text{new}} \mid y, M) = \int p(y_{\text{new}} \mid \theta, M)\, p(\theta \mid y, M)\, d\theta.

(23)

The expected log-predictive density (ELPD) of a model $M$ is

\operatorname{ELPD}(M) = \mathbb{E}_{y_{\text{new}}}\big[ \log p(y_{\text{new}} \mid y, M) \big],

(24)

where the expectation is taken over hypothetical new data $y_{\text{new}}$ from the (unknown) data-generating process.

Large ELPD (close to 0) indicates good predictive performance.
Very negative ELPD indicates poor predictions (assigning low probability to typical future data).

In practice, since the true distribution of $y_{\text{new}}$ is unknown, we approximate ELPD using cross-validation, most commonly Leave-One-Out (LOO):

For each observed data point $y_i$ , we pretend it is “new” and compute its predictive density given all other data $y_{-i}$ :
$p(y_i \mid y_{-i}, M).$
(25)
The LOO ELPD is approximated as
$\widehat{\operatorname{ELPD}}_{\text{LOO}}(M) = \sum_{i=1}^{n} \log p(y_i \mid y_{-i}, M).$
(26)

Computing this naïvely would require running an MCMC fit $n$ times, once for each $y_{-i}$ . Modern practice uses approximations like PSIS-LOO (Pareto-smoothed importance sampling) or WAIC, both available in PyMC / ArviZ.

Approach 3: Bayesian RMSE and MAE¶

Classical (frequentist) predictive metrics for a model with point predictions $\hat{y}_i$ are:

Root mean squared error (RMSE):
$\operatorname{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 }.$
(27)
Mean absolute error (MAE):
$\operatorname{MAE} = \frac{1}{n} \sum_{i=1}^{n} | y_i - \hat{y}_i |.$
(28)

In a Bayesian model, there is not a single point prediction per data point, but a predictive distribution. Using posterior predictive simulation, we can generate:

for each observed $y_i$ and MCMC draw $s=1,\dots,S$ , a predictive draw $y_i^{(s)}$ .

We can generalize RMSE and MAE by averaging over both data points and posterior samples:

Bayesian RMSE:
$\operatorname{RMSE}_{\text{Bayes}} = \sqrt{ \frac{1}{S\,n} \sum_{s=1}^{S} \sum_{i=1}^{n} \big(y_i - y_i^{(s)}\big)^2 }.$
(29)
Bayesian MAE:
$\operatorname{MAE}_{\text{Bayes}} = \frac{1}{S\,n} \sum_{s=1}^{S} \sum_{i=1}^{n} \big|y_i - y_i^{(s)}\big|.$
(30)

These metrics evaluate how close, on average, the predictive distribution is to the observed data. They emphasize typical prediction error, which can sometimes lead to different conclusions than marginal likelihoods or ELPD.

Summary of model selection approaches¶

The three approaches emphasize different aspects of model quality:

Marginal likelihood / Bayes factors (Approach 1)
- Principled fully Bayesian comparison of models.
- Automatically penalizes complexity via the prior–posterior KL divergence.
- Sensitive to prior choices and sometimes hard or unstable to compute.
ELPD / LOO / WAIC (Approach 2)
- Focuses on predictive accuracy for new data.
- Based on the posterior predictive distribution and cross-validation ideas.
- Widely used in practice (e.g. via PSIS-LOO).
Bayesian RMSE/MAE (Approach 3)
- Generalizes familiar predictive-error metrics to the Bayesian setting.
- Emphasizes typical prediction error rather than full probabilistic fit or tail behavior.

Apparent disagreements between these approaches in small data sets are not contradictions; they simply reflect that different questions are being asked about model quality.

Multivariate Bayesian problems¶

Many Bayesian models involve multiple parameters and/or vector-valued data. For example:

A normal likelihood with unknown mean $\mu$ and standard deviation $\sigma$ .
A regression model with several regression coefficients and a noise standard deviation.
Categorical or multinomial outcomes with multiple category probabilities.

In multivariate settings, we care about:

the joint posterior distribution over all parameters,
dependencies and correlations between parameters,
and appropriate multivariate priors and likelihoods.

Covariance¶

For two random variables $X$ and $Y$ , the covariance is

\operatorname{Cov}(X,Y) = \mathbb{E}\big[(X - \mathbb{E}[X]) (Y - \mathbb{E}[Y])\big] = \mathbb{E}[X Y] - \mathbb{E}[X]\, \mathbb{E}[Y].

(31)

Intuition:

If $X$ tends to be above its mean when $Y$ is above its mean (and below when $Y$ is below), $\operatorname{Cov}(X,Y)$ is positive.
If $X$ tends to be above its mean when $Y$ is below its mean (and vice versa), covariance is negative.
If $X$ and $Y$ are unrelated (no linear association), covariance is close to zero.

Basic properties:

$\operatorname{Cov}(X,X) = \operatorname{Var}(X) \ge 0$ .
$\operatorname{Cov}(X,Y) = \operatorname{Cov}(Y,X)$ .

Empirical covariance and covariance matrix¶

Given data $(x_i, y_i)$ for $i=1,\dots,n$ , the empirical covariance is

\widehat{\operatorname{Cov}}(X,Y) = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}),

(32)

where $\bar{x}$ and $\bar{y}$ are the sample means of $x_i$ and $y_i$ .

For $p$ features arranged in an $n \times p$ data matrix $X$ (rows are observations, columns are features):

Demean each column: subtract the column means to get a centered matrix $\tilde{X}$ .
The empirical covariance matrix is
$\hat{\Sigma} = \frac{1}{n-1} \tilde{X}^{\top} \tilde{X},$
(33)
which is a $p \times p$ symmetric, positive semi-definite matrix.

Diagonal entries are the sample variances of each feature; off-diagonal entries are sample covariances between features.

Correlation and correlation matrix¶

Covariance depends on the units of measurement. To obtain a dimensionless measure of linear association, we use the Pearson correlation coefficient.

For random variables $X$ and $Y$ :

\rho_{X,Y} = \frac{\operatorname{Cov}(X,Y)} {\sqrt{\operatorname{Var}(X)\, \operatorname{Var}(Y)}}, \quad -1 \le \rho_{X,Y} \le 1.

(34)

$\rho_{X,Y} \approx 1$ : strong positive linear relationship.
$\rho_{X,Y} \approx -1$ : strong negative linear relationship.
$\rho_{X,Y} \approx 0$ : little or no linear relationship.

Empirically, we replace covariance and variances by their sample counterparts to get the sample correlation.

A correlation matrix is obtained by standardizing the covariance matrix: diagonal entries are 1, off-diagonal entries are pairwise correlations between variables.

Multinomial likelihood: generalizing the binomial¶

The binomial distribution models the number of successes in $n$ trials with success probability $\pi$ and two possible outcomes (success/failure).

For more than two mutually exclusive categories, we use the multinomial distribution.

Let there be $K$ categories with probabilities $\boldsymbol{\pi} = (\pi_1,\dots,\pi_K)$ , where

\pi_k \ge 0, \quad \sum_{k=1}^{K} \pi_k = 1.

(35)

Suppose we conduct $n$ independent trials, and let $\boldsymbol{k} = (k_1,\dots,k_K)$ be the counts in each category with $\sum_{k=1}^{K} k_k = n$ . The multinomial pmf is

P(\boldsymbol{k} \mid \boldsymbol{\pi}) = \frac{n!}{k_1! \cdots k_K!} \prod_{k=1}^{K} \pi_k^{k_k}.

(36)

This is the natural likelihood model for:

election polls with multiple candidates,
survey responses on multiple-choice scales,
multi-class classification counts,
counts of different defect types, etc.

Dirichlet prior: generalizing the beta distribution¶

For multinomial problems, the natural conjugate prior for the category probabilities $\boldsymbol{\pi}$ is the Dirichlet distribution.

With concentration parameters $\boldsymbol{\alpha} = (\alpha_1,\dots,\alpha_K)$ , where $\alpha_k > 0$ , the Dirichlet density is

p(\boldsymbol{\pi} \mid \boldsymbol{\alpha}) = \frac{\Gamma(\alpha_0)} {\prod_{k=1}^{K} \Gamma(\alpha_k)} \prod_{k=1}^{K} \pi_k^{\alpha_k - 1}, \quad \alpha_0 = \sum_{k=1}^{K} \alpha_k,

(37)

on the simplex

\left\{ \boldsymbol{\pi} \in \mathbb{R}^K \,\Big|\, \pi_k \ge 0,\; \sum_{k=1}^{K} \pi_k = 1 \right\}.

(38)

Moments:

Mean of each component:
$\mathbb{E}[\pi_k] = \frac{\alpha_k}{\alpha_0}.$
(39)

Conjugacy with the multinomial likelihood:

Prior: $\boldsymbol{\pi} \sim \operatorname{Dirichlet}(\boldsymbol{\alpha})$ .
Likelihood: $\boldsymbol{k} \mid \boldsymbol{\pi} \sim \operatorname{Multinomial}(n,\boldsymbol{\pi})$ .
Posterior:
$\boldsymbol{\pi} \mid \boldsymbol{k} \sim \operatorname{Dirichlet}(\boldsymbol{\alpha} + \boldsymbol{k}),$
(40)
i.e. we simply add the counts to the prior parameters.

The Dirichlet distribution is the multivariate generalization of the beta distribution (which corresponds to $K=2$ ).

Multivariate normal distribution¶

The multivariate normal (Gaussian) distribution extends the univariate normal to $\mathbb{R}^d$ .

A random vector $\mathbf{X} \in \mathbb{R}^d$ has a multivariate normal distribution with mean vector $\boldsymbol{\mu} \in \mathbb{R}^d$ and covariance matrix $\Sigma$ (symmetric, positive definite), written $\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)$ , if its density is

p(\mathbf{x} \mid \boldsymbol{\mu}, \Sigma) = \frac{1}{ (2\pi)^{d/2} |\Sigma|^{1/2} } \exp\left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^{\top} \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right).

(41)

Properties:

Marginals of a multivariate normal are normal.
Any linear combination $a^{\top} \mathbf{X}$ is normal.
The covariance matrix $\Sigma$ encodes variances (diagonal) and covariances (off-diagonal).

In Bayesian models:

Multivariate normal distributions appear as likelihoods for multivariate continuous data.
They also appear as priors for parameter vectors (e.g. regression coefficients).

LKJ prior for correlation matrices¶

In multivariate normal models, we often want a prior on the correlation matrix $R$ rather than directly on the covariance matrix $\Sigma$ .

We can write

\Sigma = D\, R\, D,

(42)

where $D$ is a diagonal matrix with standard deviations $\sigma_1,\dots,\sigma_d$ on the diagonal, and $R$ is a $d \times d$ correlation matrix (ones on the diagonal, off-diagonal entries between -1 and 1).

The Lewandowski–Kurowicka–Joe (LKJ) distribution is a flexible prior for correlation matrices $R$ :

Parameter: $\eta > 0$ (shape).
Density (for $d \ge 2$ ) is proportional to
$p(R \mid \eta) \propto |R|^{\eta - 1},$
(43)
up to a normalization constant that depends on $d$ and $\eta$ .

Interpretation:

$\eta = 1$ : roughly uniform over correlation matrices (no preference for any correlation structure).
$\eta > 1$ : prior mass is concentrated near the identity matrix (weak correlations).
$\eta < 1$ : prior mass favors strong correlations.

A common strategy:

Put priors on the standard deviations $\sigma_i$ (e.g. half-normal or exponential).
Put an LKJ prior on the correlation matrix $R$ .
Construct $\Sigma = D R D$ (sometimes via a Cholesky factorization for numerical stability).

This combination is sometimes referred to as an LKJ covariance prior.