Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Model Checks, Model Selection, Multivariate Distributions

Posterior predictive checks

Bayesian models are generative: once we have a posterior p(θy)p(\theta \mid y), we can simulate new data sets from the model.

For observed data yy and future (or replicated) data yrepy_{\text{rep}}, the posterior predictive distribution is

p(yrepy)=p(yrepθ)p(θy)dθ.p(y_{\text{rep}} \mid y) = \int p(y_{\text{rep}} \mid \theta)\, p(\theta \mid y)\, d\theta.

A posterior predictive check compares:

  • the observed data yy, and

  • many replicated datasets yrep(1),,yrep(S)y_{\text{rep}}^{(1)},\dots,y_{\text{rep}}^{(S)} drawn from p(yrepy)p(y_{\text{rep}} \mid y).

If the simulated data are, on average, very different from the observed data (for example, in mean, variance, tails, shape…), then the likelihood model p(yθ)p(y \mid \theta) is likely misspecified.

Important points from the slides:

  • Priors “fade away” as we collect more data; the likelihood stays.

  • If the likelihood is structurally wrong, it will stay wrong no matter how much data we collect.

  • The goal is not to find a perfect model (none exists), but a model whose predictions are compatible with the data.

Poisson likelihood and its limitations for count data

For count data Y{0,1,2,}Y \in \{0,1,2,\dots\}, a common starting point is the Poisson distribution:

P(Y=kλ)=λkeλk!,k=0,1,2,,  λ>0.P(Y = k \mid \lambda) = \frac{\lambda^{k} e^{-\lambda}}{k!}, \quad k = 0,1,2,\dots,\; \lambda > 0.

Moments:

  • Mean:

    E[Yλ]=λ\mathbb{E}[Y \mid \lambda] = \lambda
  • Variance:

    Var(Yλ)=λ\operatorname{Var}(Y \mid \lambda) = \lambda

The key limitation:

  • The Poisson forces mean and variance to be equal.

  • Many real count data sets show overdispersion:

    Var(Y)>E[Y].\operatorname{Var}(Y) > \mathbb{E}[Y].

Posterior predictive checks can reveal this: simulated data from the Poisson model may have too low variance compared to the observed data.

The negative binomial distribution

The negative binomial distribution is a flexible alternative to the Poisson for count data with overdispersion.

Classical interpretation:

  • Consider repeated Bernoulli trials with success probability pp.

  • Let KK be the number of failures observed before the rr-th success.

Then KK follows a negative binomial distribution with parameters (r,p)(r,p):

P(K=kr,p)=(k+r1k)(1p)kpr,k=0,1,2,,  r>0,  0<p<1.P(K = k \mid r, p) = \binom{k + r - 1}{k}\, (1-p)^k\, p^r, \quad k = 0,1,2,\dots,\; r > 0,\; 0 < p < 1.

Moments in this parameterization:

  • Mean:

    E[Kr,p]=r1pp\mathbb{E}[K \mid r,p] = r\, \frac{1-p}{p}
  • Variance:

    Var(Kr,p)=r1pp2.\operatorname{Var}(K \mid r,p) = r\, \frac{1-p}{p^2}.

We can reparameterize in terms of a mean λ\lambda and overdispersion parameter rr.

Set

λ=r1pp,so thatp=rr+λ.\lambda = r\, \frac{1-p}{p}, \quad \text{so that}\quad p = \frac{r}{r + \lambda}.

Substituting into the pmf yields an equivalent negative binomial distribution with parameters (r,λ)(r, \lambda),

P(K=kr,λ)=(k+r1k)(λλ+r)k(rλ+r)r.P(K = k \mid r, \lambda) = \binom{k + r - 1}{k} \left( \frac{\lambda}{\lambda + r} \right)^k \left( \frac{r}{\lambda + r} \right)^r.

In this form,

  • Mean:

    E[Kr,λ]=λ\mathbb{E}[K \mid r,\lambda] = \lambda
  • Variance:

    Var(Kr,λ)=λ+λ2r.\operatorname{Var}(K \mid r,\lambda) = \lambda + \frac{\lambda^2}{r}.

Thus, for finite rr, we have overdispersion:

Var(Kr,λ)>E[Kr,λ].\operatorname{Var}(K \mid r,\lambda) > \mathbb{E}[K \mid r,\lambda].

In the limit rr \to \infty, the variance approaches the mean and the negative binomial distribution converges to a Poisson distribution with rate λ\lambda.

This makes the negative binomial a natural generalization of the Poisson that allows the variance to be separately controlled via rr.

Posterior predictive checks: Poisson vs negative binomial

In the maternity-ward example (beds occupied per night):

  • A Poisson model might underestimate the variance of bed counts.

  • A negative binomial model allows the variance to exceed the mean and can better match the data.

Posterior predictive checks reveal that:

  • The Poisson model’s replicated data tend to have too little dispersion.

  • The negative binomial model’s replicated data better match both the mean and the spread of observed counts.

However:

  • Adding an extra parameter (the dispersion rr or α\alpha) increases epistemic uncertainty.

  • There is a trade-off: more flexible models fit data better but are harder to estimate with limited data.

Model selection: general ideas

Model selection asks: which model family explains the data best?

Typical desiderata:

  • Good fit to the observed data.

  • Good predictive performance on unseen data.

  • Avoiding unnecessary complexity (overfitting).

  • Reasonable interpretability.

In a Bayesian framework, model selection can be approached in several ways:

  1. Posterior model probabilities via marginal likelihoods and Bayes factors.

  2. Expected log-predictive density (ELPD), typically via cross-validation.

  3. Predictive-error metrics (RMSE, MAE) generalized to the Bayesian setting.

All three approaches are complementary and emphasize different aspects of model quality.

Approach 1: Bayesian model comparison via marginal likelihood

Treat each model MiM_i as a hypothesis. For each model:

  • Parameters: θi\theta_i,

  • Prior: p(θiMi)p(\theta_i \mid M_i),

  • Likelihood: p(yθi,Mi)p(y \mid \theta_i, M_i).

The marginal likelihood (or model evidence) is

p(yMi)=p(yθi,Mi)p(θiMi)dθi.p(y \mid M_i) = \int p(y \mid \theta_i, M_i)\, p(\theta_i \mid M_i)\, d\theta_i.

Given prior model probabilities P(Mi)P(M_i), the posterior model probability is

P(Miy)=p(yMi)P(Mi)jp(yMj)P(Mj).P(M_i \mid y) = \frac{p(y \mid M_i)\, P(M_i)} {\sum_j p(y \mid M_j)\, P(M_j)}.

For two models M0M_0 and M1M_1:

  • Prior odds:

    P(M1)P(M0);\frac{P(M_1)}{P(M_0)};
  • Posterior odds:

    P(M1y)P(M0y)=BF10P(M1)P(M0),\frac{P(M_1 \mid y)}{P(M_0 \mid y)} = \operatorname{BF}_{10}\, \frac{P(M_1)}{P(M_0)},
  • Bayes factor:

    BF10=p(yM1)p(yM0).\operatorname{BF}_{10} = \frac{p(y \mid M_1)}{p(y \mid M_0)}.

Bayes factors quantify how much the data shift our odds between models.

Marginal likelihood: trade-off between accuracy and complexity

The log marginal likelihood admits a useful decomposition that highlights a trade-off between accuracy and complexity.

For a model MM with parameters θ\theta, prior p(θM)p(\theta \mid M), likelihood p(yθ,M)p(y \mid \theta,M), and posterior p(θy,M)p(\theta \mid y,M), we have

logp(yM)=Eθy,M[logp(yθ,M)]KL ⁣(p(θy,M)    p(θM)),\log p(y \mid M) = \mathbb{E}_{\theta \mid y,M}\big[\log p(y \mid \theta,M)\big] - \operatorname{KL}\!\big( p(\theta \mid y,M)\; \|\; p(\theta \mid M) \big),

where

  • Eθy,M[]\mathbb{E}_{\theta \mid y,M}[\cdot] is expectation w.r.t. the posterior p(θy,M)p(\theta \mid y,M),

  • KL(pq)\operatorname{KL}(p\|q) is the Kullback–Leibler divergence,

    KL(pq)=p(θ)logp(θ)q(θ)dθ.\operatorname{KL}(p\|q) = \int p(\theta)\, \log \frac{p(\theta)}{q(\theta)}\, d\theta.

Interpretation:

  • Accuracy term

    Eθy,M[logp(yθ,M)]\mathbb{E}_{\theta \mid y,M}\big[\log p(y \mid \theta,M)\big]

    measures how well the model fits the data on average under the posterior.

  • Complexity term

    KL ⁣(p(θy,M)    p(θM))\operatorname{KL}\!\big( p(\theta \mid y,M)\; \|\; p(\theta \mid M) \big)

    measures the information gain from prior to posterior (always non-negative).

Models with very weak priors and many parameters tend to have:

  • high potential accuracy,

  • but also high complexity (large information gain),

  • which can reduce the marginal likelihood.

Hence, the marginal likelihood automatically implements an Occam’s razor: it balances fit against complexity.

Approach 2: Expected log-predictive density (ELPD)

An alternative view focuses on predictive performance: how well does a model predict new, unseen data?

Let ynewy_{\text{new}} be a future observation. The posterior predictive density is

p(ynewy,M)=p(ynewθ,M)p(θy,M)dθ.p(y_{\text{new}} \mid y, M) = \int p(y_{\text{new}} \mid \theta, M)\, p(\theta \mid y, M)\, d\theta.

The expected log-predictive density (ELPD) of a model MM is

ELPD(M)=Eynew[logp(ynewy,M)],\operatorname{ELPD}(M) = \mathbb{E}_{y_{\text{new}}}\big[ \log p(y_{\text{new}} \mid y, M) \big],

where the expectation is taken over hypothetical new data ynewy_{\text{new}} from the (unknown) data-generating process.

  • Large ELPD (close to 0) indicates good predictive performance.

  • Very negative ELPD indicates poor predictions (assigning low probability to typical future data).

In practice, since the true distribution of ynewy_{\text{new}} is unknown, we approximate ELPD using cross-validation, most commonly Leave-One-Out (LOO):

  • For each observed data point yiy_i, we pretend it is “new” and compute its predictive density given all other data yiy_{-i}:

    p(yiyi,M).p(y_i \mid y_{-i}, M).
  • The LOO ELPD is approximated as

    ELPD^LOO(M)=i=1nlogp(yiyi,M).\widehat{\operatorname{ELPD}}_{\text{LOO}}(M) = \sum_{i=1}^{n} \log p(y_i \mid y_{-i}, M).

Computing this naïvely would require running an MCMC fit nn times, once for each yiy_{-i}. Modern practice uses approximations like PSIS-LOO (Pareto-smoothed importance sampling) or WAIC, both available in PyMC / ArviZ.

Approach 3: Bayesian RMSE and MAE

Classical (frequentist) predictive metrics for a model with point predictions y^i\hat{y}_i are:

  • Root mean squared error (RMSE):

    RMSE=1ni=1n(yiy^i)2.\operatorname{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 }.
  • Mean absolute error (MAE):

    MAE=1ni=1nyiy^i.\operatorname{MAE} = \frac{1}{n} \sum_{i=1}^{n} | y_i - \hat{y}_i |.

In a Bayesian model, there is not a single point prediction per data point, but a predictive distribution. Using posterior predictive simulation, we can generate:

  • for each observed yiy_i and MCMC draw s=1,,Ss=1,\dots,S, a predictive draw yi(s)y_i^{(s)}.

We can generalize RMSE and MAE by averaging over both data points and posterior samples:

  • Bayesian RMSE:

    RMSEBayes=1Sns=1Si=1n(yiyi(s))2.\operatorname{RMSE}_{\text{Bayes}} = \sqrt{ \frac{1}{S\,n} \sum_{s=1}^{S} \sum_{i=1}^{n} \big(y_i - y_i^{(s)}\big)^2 }.
  • Bayesian MAE:

    MAEBayes=1Sns=1Si=1nyiyi(s).\operatorname{MAE}_{\text{Bayes}} = \frac{1}{S\,n} \sum_{s=1}^{S} \sum_{i=1}^{n} \big|y_i - y_i^{(s)}\big|.

These metrics evaluate how close, on average, the predictive distribution is to the observed data. They emphasize typical prediction error, which can sometimes lead to different conclusions than marginal likelihoods or ELPD.

Summary of model selection approaches

The three approaches emphasize different aspects of model quality:

  1. Marginal likelihood / Bayes factors (Approach 1)

    • Principled fully Bayesian comparison of models.

    • Automatically penalizes complexity via the prior–posterior KL divergence.

    • Sensitive to prior choices and sometimes hard or unstable to compute.

  2. ELPD / LOO / WAIC (Approach 2)

    • Focuses on predictive accuracy for new data.

    • Based on the posterior predictive distribution and cross-validation ideas.

    • Widely used in practice (e.g. via PSIS-LOO).

  3. Bayesian RMSE/MAE (Approach 3)

    • Generalizes familiar predictive-error metrics to the Bayesian setting.

    • Emphasizes typical prediction error rather than full probabilistic fit or tail behavior.

Apparent disagreements between these approaches in small data sets are not contradictions; they simply reflect that different questions are being asked about model quality.

Multivariate Bayesian problems

Many Bayesian models involve multiple parameters and/or vector-valued data. For example:

  • A normal likelihood with unknown mean μ\mu and standard deviation σ\sigma.

  • A regression model with several regression coefficients and a noise standard deviation.

  • Categorical or multinomial outcomes with multiple category probabilities.

In multivariate settings, we care about:

  • the joint posterior distribution over all parameters,

  • dependencies and correlations between parameters,

  • and appropriate multivariate priors and likelihoods.

Covariance

For two random variables XX and YY, the covariance is

Cov(X,Y)=E[(XE[X])(YE[Y])]=E[XY]E[X]E[Y].\operatorname{Cov}(X,Y) = \mathbb{E}\big[(X - \mathbb{E}[X]) (Y - \mathbb{E}[Y])\big] = \mathbb{E}[X Y] - \mathbb{E}[X]\, \mathbb{E}[Y].

Intuition:

  • If XX tends to be above its mean when YY is above its mean (and below when YY is below), Cov(X,Y)\operatorname{Cov}(X,Y) is positive.

  • If XX tends to be above its mean when YY is below its mean (and vice versa), covariance is negative.

  • If XX and YY are unrelated (no linear association), covariance is close to zero.

Basic properties:

  • Cov(X,X)=Var(X)0\operatorname{Cov}(X,X) = \operatorname{Var}(X) \ge 0.

  • Cov(X,Y)=Cov(Y,X)\operatorname{Cov}(X,Y) = \operatorname{Cov}(Y,X).

Empirical covariance and covariance matrix

Given data (xi,yi)(x_i, y_i) for i=1,,ni=1,\dots,n, the empirical covariance is

Cov^(X,Y)=1n1i=1n(xixˉ)(yiyˉ),\widehat{\operatorname{Cov}}(X,Y) = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y}),

where xˉ\bar{x} and yˉ\bar{y} are the sample means of xix_i and yiy_i.

For pp features arranged in an n×pn \times p data matrix XX (rows are observations, columns are features):

  1. Demean each column: subtract the column means to get a centered matrix X~\tilde{X}.

  2. The empirical covariance matrix is

    Σ^=1n1X~X~,\hat{\Sigma} = \frac{1}{n-1} \tilde{X}^{\top} \tilde{X},

    which is a p×pp \times p symmetric, positive semi-definite matrix.

Diagonal entries are the sample variances of each feature; off-diagonal entries are sample covariances between features.

Correlation and correlation matrix

Covariance depends on the units of measurement. To obtain a dimensionless measure of linear association, we use the Pearson correlation coefficient.

For random variables XX and YY:

ρX,Y=Cov(X,Y)Var(X)Var(Y),1ρX,Y1.\rho_{X,Y} = \frac{\operatorname{Cov}(X,Y)} {\sqrt{\operatorname{Var}(X)\, \operatorname{Var}(Y)}}, \quad -1 \le \rho_{X,Y} \le 1.
  • ρX,Y1\rho_{X,Y} \approx 1: strong positive linear relationship.

  • ρX,Y1\rho_{X,Y} \approx -1: strong negative linear relationship.

  • ρX,Y0\rho_{X,Y} \approx 0: little or no linear relationship.

Empirically, we replace covariance and variances by their sample counterparts to get the sample correlation.

A correlation matrix is obtained by standardizing the covariance matrix: diagonal entries are 1, off-diagonal entries are pairwise correlations between variables.

Multinomial likelihood: generalizing the binomial

The binomial distribution models the number of successes in nn trials with success probability π\pi and two possible outcomes (success/failure).

For more than two mutually exclusive categories, we use the multinomial distribution.

Let there be KK categories with probabilities π=(π1,,πK)\boldsymbol{\pi} = (\pi_1,\dots,\pi_K), where

πk0,k=1Kπk=1.\pi_k \ge 0, \quad \sum_{k=1}^{K} \pi_k = 1.

Suppose we conduct nn independent trials, and let k=(k1,,kK)\boldsymbol{k} = (k_1,\dots,k_K) be the counts in each category with k=1Kkk=n\sum_{k=1}^{K} k_k = n. The multinomial pmf is

P(kπ)=n!k1!kK!k=1Kπkkk.P(\boldsymbol{k} \mid \boldsymbol{\pi}) = \frac{n!}{k_1! \cdots k_K!} \prod_{k=1}^{K} \pi_k^{k_k}.

This is the natural likelihood model for:

  • election polls with multiple candidates,

  • survey responses on multiple-choice scales,

  • multi-class classification counts,

  • counts of different defect types, etc.

Dirichlet prior: generalizing the beta distribution

For multinomial problems, the natural conjugate prior for the category probabilities π\boldsymbol{\pi} is the Dirichlet distribution.

With concentration parameters α=(α1,,αK)\boldsymbol{\alpha} = (\alpha_1,\dots,\alpha_K), where αk>0\alpha_k > 0, the Dirichlet density is

p(πα)=Γ(α0)k=1KΓ(αk)k=1Kπkαk1,α0=k=1Kαk,p(\boldsymbol{\pi} \mid \boldsymbol{\alpha}) = \frac{\Gamma(\alpha_0)} {\prod_{k=1}^{K} \Gamma(\alpha_k)} \prod_{k=1}^{K} \pi_k^{\alpha_k - 1}, \quad \alpha_0 = \sum_{k=1}^{K} \alpha_k,

on the simplex

{πRKπk0,  k=1Kπk=1}.\left\{ \boldsymbol{\pi} \in \mathbb{R}^K \,\Big|\, \pi_k \ge 0,\; \sum_{k=1}^{K} \pi_k = 1 \right\}.

Moments:

  • Mean of each component:

    E[πk]=αkα0.\mathbb{E}[\pi_k] = \frac{\alpha_k}{\alpha_0}.

Conjugacy with the multinomial likelihood:

  • Prior: πDirichlet(α)\boldsymbol{\pi} \sim \operatorname{Dirichlet}(\boldsymbol{\alpha}).

  • Likelihood: kπMultinomial(n,π)\boldsymbol{k} \mid \boldsymbol{\pi} \sim \operatorname{Multinomial}(n,\boldsymbol{\pi}).

  • Posterior:

    πkDirichlet(α+k),\boldsymbol{\pi} \mid \boldsymbol{k} \sim \operatorname{Dirichlet}(\boldsymbol{\alpha} + \boldsymbol{k}),

    i.e. we simply add the counts to the prior parameters.

The Dirichlet distribution is the multivariate generalization of the beta distribution (which corresponds to K=2K=2).

Multivariate normal distribution

The multivariate normal (Gaussian) distribution extends the univariate normal to Rd\mathbb{R}^d.

A random vector XRd\mathbf{X} \in \mathbb{R}^d has a multivariate normal distribution with mean vector μRd\boldsymbol{\mu} \in \mathbb{R}^d and covariance matrix Σ\Sigma (symmetric, positive definite), written XN(μ,Σ)\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma), if its density is

p(xμ,Σ)=1(2π)d/2Σ1/2exp(12(xμ)Σ1(xμ)).p(\mathbf{x} \mid \boldsymbol{\mu}, \Sigma) = \frac{1}{ (2\pi)^{d/2} |\Sigma|^{1/2} } \exp\left( -\frac{1}{2} (\mathbf{x} - \boldsymbol{\mu})^{\top} \Sigma^{-1} (\mathbf{x} - \boldsymbol{\mu}) \right).

Properties:

  • Marginals of a multivariate normal are normal.

  • Any linear combination aXa^{\top} \mathbf{X} is normal.

  • The covariance matrix Σ\Sigma encodes variances (diagonal) and covariances (off-diagonal).

In Bayesian models:

  • Multivariate normal distributions appear as likelihoods for multivariate continuous data.

  • They also appear as priors for parameter vectors (e.g. regression coefficients).

LKJ prior for correlation matrices

In multivariate normal models, we often want a prior on the correlation matrix RR rather than directly on the covariance matrix Σ\Sigma.

We can write

Σ=DRD,\Sigma = D\, R\, D,

where DD is a diagonal matrix with standard deviations σ1,,σd\sigma_1,\dots,\sigma_d on the diagonal, and RR is a d×dd \times d correlation matrix (ones on the diagonal, off-diagonal entries between -1 and 1).

The Lewandowski–Kurowicka–Joe (LKJ) distribution is a flexible prior for correlation matrices RR:

  • Parameter: η>0\eta > 0 (shape).

  • Density (for d2d \ge 2) is proportional to

    p(Rη)Rη1,p(R \mid \eta) \propto |R|^{\eta - 1},

    up to a normalization constant that depends on dd and η\eta.

Interpretation:

  • η=1\eta = 1: roughly uniform over correlation matrices (no preference for any correlation structure).

  • η>1\eta > 1: prior mass is concentrated near the identity matrix (weak correlations).

  • η<1\eta < 1: prior mass favors strong correlations.

A common strategy:

  1. Put priors on the standard deviations σi\sigma_i (e.g. half-normal or exponential).

  2. Put an LKJ prior on the correlation matrix RR.

  3. Construct Σ=DRD\Sigma = D R D (sometimes via a Cholesky factorization for numerical stability).

This combination is sometimes referred to as an LKJ covariance prior.