Model Checks, Model Selection, Multivariate Distributions
Posterior predictive checks¶
Bayesian models are generative: once we have a posterior , we can simulate new data sets from the model.
For observed data and future (or replicated) data , the posterior predictive distribution is
A posterior predictive check compares:
the observed data , and
many replicated datasets drawn from .
If the simulated data are, on average, very different from the observed data (for example, in mean, variance, tails, shape…), then the likelihood model is likely misspecified.
Important points from the slides:
Priors “fade away” as we collect more data; the likelihood stays.
If the likelihood is structurally wrong, it will stay wrong no matter how much data we collect.
The goal is not to find a perfect model (none exists), but a model whose predictions are compatible with the data.
Poisson likelihood and its limitations for count data¶
For count data , a common starting point is the Poisson distribution:
Moments:
Mean:
Variance:
The key limitation:
The Poisson forces mean and variance to be equal.
Many real count data sets show overdispersion:
Posterior predictive checks can reveal this: simulated data from the Poisson model may have too low variance compared to the observed data.
The negative binomial distribution¶
The negative binomial distribution is a flexible alternative to the Poisson for count data with overdispersion.
Classical interpretation:
Consider repeated Bernoulli trials with success probability .
Let be the number of failures observed before the -th success.
Then follows a negative binomial distribution with parameters :
Moments in this parameterization:
Mean:
Variance:
We can reparameterize in terms of a mean and overdispersion parameter .
Set
Substituting into the pmf yields an equivalent negative binomial distribution with parameters ,
In this form,
Mean:
Variance:
Thus, for finite , we have overdispersion:
In the limit , the variance approaches the mean and the negative binomial distribution converges to a Poisson distribution with rate .
This makes the negative binomial a natural generalization of the Poisson that allows the variance to be separately controlled via .
Posterior predictive checks: Poisson vs negative binomial¶
In the maternity-ward example (beds occupied per night):
A Poisson model might underestimate the variance of bed counts.
A negative binomial model allows the variance to exceed the mean and can better match the data.
Posterior predictive checks reveal that:
The Poisson model’s replicated data tend to have too little dispersion.
The negative binomial model’s replicated data better match both the mean and the spread of observed counts.
However:
Adding an extra parameter (the dispersion or ) increases epistemic uncertainty.
There is a trade-off: more flexible models fit data better but are harder to estimate with limited data.
Model selection: general ideas¶
Model selection asks: which model family explains the data best?
Typical desiderata:
Good fit to the observed data.
Good predictive performance on unseen data.
Avoiding unnecessary complexity (overfitting).
Reasonable interpretability.
In a Bayesian framework, model selection can be approached in several ways:
Posterior model probabilities via marginal likelihoods and Bayes factors.
Expected log-predictive density (ELPD), typically via cross-validation.
Predictive-error metrics (RMSE, MAE) generalized to the Bayesian setting.
All three approaches are complementary and emphasize different aspects of model quality.
Approach 1: Bayesian model comparison via marginal likelihood¶
Treat each model as a hypothesis. For each model:
Parameters: ,
Prior: ,
Likelihood: .
The marginal likelihood (or model evidence) is
Given prior model probabilities , the posterior model probability is
For two models and :
Prior odds:
Posterior odds:
Bayes factor:
Bayes factors quantify how much the data shift our odds between models.
Marginal likelihood: trade-off between accuracy and complexity¶
The log marginal likelihood admits a useful decomposition that highlights a trade-off between accuracy and complexity.
For a model with parameters , prior , likelihood , and posterior , we have
where
is expectation w.r.t. the posterior ,
is the Kullback–Leibler divergence,
Interpretation:
Accuracy term
measures how well the model fits the data on average under the posterior.
Complexity term
measures the information gain from prior to posterior (always non-negative).
Models with very weak priors and many parameters tend to have:
high potential accuracy,
but also high complexity (large information gain),
which can reduce the marginal likelihood.
Hence, the marginal likelihood automatically implements an Occam’s razor: it balances fit against complexity.
Approach 2: Expected log-predictive density (ELPD)¶
An alternative view focuses on predictive performance: how well does a model predict new, unseen data?
Let be a future observation. The posterior predictive density is
The expected log-predictive density (ELPD) of a model is
where the expectation is taken over hypothetical new data from the (unknown) data-generating process.
Large ELPD (close to 0) indicates good predictive performance.
Very negative ELPD indicates poor predictions (assigning low probability to typical future data).
In practice, since the true distribution of is unknown, we approximate ELPD using cross-validation, most commonly Leave-One-Out (LOO):
For each observed data point , we pretend it is “new” and compute its predictive density given all other data :
The LOO ELPD is approximated as
Computing this naïvely would require running an MCMC fit times, once for each . Modern practice uses approximations like PSIS-LOO (Pareto-smoothed importance sampling) or WAIC, both available in PyMC / ArviZ.
Approach 3: Bayesian RMSE and MAE¶
Classical (frequentist) predictive metrics for a model with point predictions are:
Root mean squared error (RMSE):
Mean absolute error (MAE):
In a Bayesian model, there is not a single point prediction per data point, but a predictive distribution. Using posterior predictive simulation, we can generate:
for each observed and MCMC draw , a predictive draw .
We can generalize RMSE and MAE by averaging over both data points and posterior samples:
Bayesian RMSE:
Bayesian MAE:
These metrics evaluate how close, on average, the predictive distribution is to the observed data. They emphasize typical prediction error, which can sometimes lead to different conclusions than marginal likelihoods or ELPD.
Summary of model selection approaches¶
The three approaches emphasize different aspects of model quality:
Marginal likelihood / Bayes factors (Approach 1)
Principled fully Bayesian comparison of models.
Automatically penalizes complexity via the prior–posterior KL divergence.
Sensitive to prior choices and sometimes hard or unstable to compute.
ELPD / LOO / WAIC (Approach 2)
Focuses on predictive accuracy for new data.
Based on the posterior predictive distribution and cross-validation ideas.
Widely used in practice (e.g. via PSIS-LOO).
Bayesian RMSE/MAE (Approach 3)
Generalizes familiar predictive-error metrics to the Bayesian setting.
Emphasizes typical prediction error rather than full probabilistic fit or tail behavior.
Apparent disagreements between these approaches in small data sets are not contradictions; they simply reflect that different questions are being asked about model quality.
Multivariate Bayesian problems¶
Many Bayesian models involve multiple parameters and/or vector-valued data. For example:
A normal likelihood with unknown mean and standard deviation .
A regression model with several regression coefficients and a noise standard deviation.
Categorical or multinomial outcomes with multiple category probabilities.
In multivariate settings, we care about:
the joint posterior distribution over all parameters,
dependencies and correlations between parameters,
and appropriate multivariate priors and likelihoods.
Covariance¶
For two random variables and , the covariance is
Intuition:
If tends to be above its mean when is above its mean (and below when is below), is positive.
If tends to be above its mean when is below its mean (and vice versa), covariance is negative.
If and are unrelated (no linear association), covariance is close to zero.
Basic properties:
.
.
Empirical covariance and covariance matrix¶
Given data for , the empirical covariance is
where and are the sample means of and .
For features arranged in an data matrix (rows are observations, columns are features):
Demean each column: subtract the column means to get a centered matrix .
The empirical covariance matrix is
which is a symmetric, positive semi-definite matrix.
Diagonal entries are the sample variances of each feature; off-diagonal entries are sample covariances between features.
Correlation and correlation matrix¶
Covariance depends on the units of measurement. To obtain a dimensionless measure of linear association, we use the Pearson correlation coefficient.
For random variables and :
: strong positive linear relationship.
: strong negative linear relationship.
: little or no linear relationship.
Empirically, we replace covariance and variances by their sample counterparts to get the sample correlation.
A correlation matrix is obtained by standardizing the covariance matrix: diagonal entries are 1, off-diagonal entries are pairwise correlations between variables.
Multinomial likelihood: generalizing the binomial¶
The binomial distribution models the number of successes in trials with success probability and two possible outcomes (success/failure).
For more than two mutually exclusive categories, we use the multinomial distribution.
Let there be categories with probabilities , where
Suppose we conduct independent trials, and let be the counts in each category with . The multinomial pmf is
This is the natural likelihood model for:
election polls with multiple candidates,
survey responses on multiple-choice scales,
multi-class classification counts,
counts of different defect types, etc.
Dirichlet prior: generalizing the beta distribution¶
For multinomial problems, the natural conjugate prior for the category probabilities is the Dirichlet distribution.
With concentration parameters , where , the Dirichlet density is
on the simplex
Moments:
Mean of each component:
Conjugacy with the multinomial likelihood:
Prior: .
Likelihood: .
Posterior:
i.e. we simply add the counts to the prior parameters.
The Dirichlet distribution is the multivariate generalization of the beta distribution (which corresponds to ).
Multivariate normal distribution¶
The multivariate normal (Gaussian) distribution extends the univariate normal to .
A random vector has a multivariate normal distribution with mean vector and covariance matrix (symmetric, positive definite), written , if its density is
Properties:
Marginals of a multivariate normal are normal.
Any linear combination is normal.
The covariance matrix encodes variances (diagonal) and covariances (off-diagonal).
In Bayesian models:
Multivariate normal distributions appear as likelihoods for multivariate continuous data.
They also appear as priors for parameter vectors (e.g. regression coefficients).
LKJ prior for correlation matrices¶
In multivariate normal models, we often want a prior on the correlation matrix rather than directly on the covariance matrix .
We can write
where is a diagonal matrix with standard deviations on the diagonal, and is a correlation matrix (ones on the diagonal, off-diagonal entries between -1 and 1).
The Lewandowski–Kurowicka–Joe (LKJ) distribution is a flexible prior for correlation matrices :
Parameter: (shape).
Density (for ) is proportional to
up to a normalization constant that depends on and .
Interpretation:
: roughly uniform over correlation matrices (no preference for any correlation structure).
: prior mass is concentrated near the identity matrix (weak correlations).
: prior mass favors strong correlations.
A common strategy:
Put priors on the standard deviations (e.g. half-normal or exponential).
Put an LKJ prior on the correlation matrix .
Construct (sometimes via a Cholesky factorization for numerical stability).
This combination is sometimes referred to as an LKJ covariance prior.