Estimation, Prediction, Hypothesis Tests
Estimation: summarizing posterior distributions¶
After performing Bayesian inference, we obtain a posterior distribution for parameters given data . To communicate results, we usually summarize this distribution with a few key numbers.
For a scalar parameter :
Posterior mean
Posterior variance
Posterior standard deviation
Posterior mode / MAP estimate
The maximum a posteriori (MAP) estimate is
i.e. the value of where the posterior density is largest.
MAP estimate for the beta–binomial model¶
In the beta–binomial model from earlier weeks,
Prior: ,
Likelihood: ,
the posterior is again beta:
For a beta distribution with shape parameters and , the mode is
Thus, in the beta–binomial case, the MAP estimate of is
If the shape parameters are , the density is peaked at the boundaries, and the mode is at 0 or 1.
Estimation from posterior samples¶
In realistic Bayesian models, the posterior is usually not available in closed form. Instead, we obtain Monte Carlo samples from via MCMC.
Posterior summaries are then computed as sample statistics:
Approximate posterior mean:
Approximate posterior variance:
Approximate posterior standard deviation:
Computing the mode from samples is less direct: one must estimate the density (e.g. via kernel density estimation) and find its maximum numerically.
Credible intervals¶
A credible interval for a parameter is an interval such that the posterior probability that lies in the interval equals some chosen level (e.g. 0.8, 0.89, 0.95):
For example, an 80% credible interval satisfies
Two common types of credible intervals:
Central (middle) credible interval of level :
where is the -quantile of the posterior.
Highest Density Interval (HDI) of level : an interval with
posterior probability mass ,
and maximum posterior density inside the interval compared to outside.
Credible intervals are also sometimes called compatibility intervals because they describe the range of parameter values most compatible with the observed data and the model.
Highest Density Interval (HDI)¶
The Highest Density Interval (HDI) of level is the smallest interval such that
it contains probability mass :
every point inside the interval has higher posterior density than any point outside:
Intuition:
If the posterior is symmetric and unimodal (e.g. nearly normal), the HDI and central interval are very similar.
If the posterior is skewed or multimodal, the HDI focuses on the region(s) of highest posterior density.
A conceptual algorithm for computing an HDI from a univariate posterior:
Choose a probability level (e.g. 0.8 or 0.94).
Consider all intervals that contain probability mass .
Among these, select the interval with the smallest width .
In practice, for posterior samples, libraries such as ArviZ (az.hdi) approximate the HDI directly.
Note: PyMC / ArviZ often use an HDI probability of 94% by default. This is a reminder that, unlike the conventional 95% frequentist confidence interval, the exact choice of probability (e.g. 89%, 94%, 95%) is somewhat arbitrary and should be chosen to communicate uncertainty effectively.
Credible intervals vs confidence intervals¶
Bayesian credible interval (CI) for (level ):
Definition:
Interpretation:
Given the data and the model, the probability that lies in is .
Probability statements are about the parameter (which is uncertain).
Frequentist confidence interval (CI) for (level ):
Definition (informally): If we repeatedly collect data sets under identical conditions and construct an interval from each sample using a specified procedure, then in the long run a fraction of these intervals will contain the true parameter .
Interpretation:
The procedure has coverage probability ; in repeated sampling, -fraction of the intervals contain .
Incorrect but common statement (to avoid):
“Our particular confidence interval contains the true value with probability .”
In the frequentist framework, is not random, so this is not strictly correct.
Key differences:
Bayesian CIs express a degree of belief about the parameter in light of data and priors.
Frequentist CIs are statements about the procedure and long-run frequency properties, not about a single realized interval.
Posterior predictive distribution¶
Given data and parameters , the likelihood describes the distribution of future or hypothetical observations given a fixed value of .
In Bayesian inference, we do not know exactly, but we have a posterior . The predictive distribution for future data given observed data is obtained by marginalising out :
This distribution is called the posterior predictive distribution.
Interpretation:
is a model-based forecast that averages the likelihood over all plausible values of , weighted by their posterior probability.
It naturally accounts for parameter uncertainty as well as sampling variability.
Posterior predictive simulation from samples¶
In practice, we usually have posterior samples from .
We can approximate by simulation:
Draw (these are the MCMC samples).
For each , draw
from the likelihood (for example, binomial, normal, Poisson, etc.).
The collection is a sample from the posterior predictive distribution.
Posterior predictive summaries:
Predictive mean:
Predictive variance and standard deviation: sample variance and SD of the .
Predictive credible intervals: quantiles or HDIs of the .
This method avoids explicit evaluation of the integral
and works for essentially arbitrary models.
Decomposing predictive uncertainty: aleatoric vs epistemic¶
The predictive distribution mixes two kinds of uncertainty:
Aleatoric uncertainty (sampling variability, inherent noise): variability in even if were known exactly.
Epistemic uncertainty (model/parameter uncertainty): additional variability due to our imperfect knowledge of , represented by the posterior .
The decomposition is captured by the law of total variance. Let denote a future observation and the parameter. Then
Interpretation:
The first term
is the aleatoric variance: average inherent noise in for a given , averaged over the posterior of .
The second term
is the epistemic variance: variability in the predictive mean due to uncertainty in .
As we collect more data:
the epistemic component typically decreases (posterior concentrates),
the aleatoric component remains (it is intrinsic to the data-generating process).
Frequentist hypothesis testing for a proportion¶
Consider testing a statement about a population proportion based on binomial data .
Example one-sided test:
Null hypothesis:
Alternative hypothesis:
A typical frequentist workflow:
Estimator
Use the sample proportionTest statistic (approximate test)
Under (with large ), the standardized test statisticis approximately standard normal: .
p-value
For the one-sided test, the p-value iswhere is the observed test statistic.
Decision
Choose a significance level (e.g. ). If the p-value , reject ; otherwise, do not reject .
Note that in the frequentist framework we never assign a probability to or themselves; we only evaluate the probability of data or statistics under .
Bayesian hypothesis testing, posterior odds, and Bayes factors¶
In the Bayesian framework, hypotheses are treated like any other propositions to which we can assign prior and posterior probabilities.
Consider two competing hypotheses and (e.g., statements about a parameter range).
Prior probabilities: and
Prior odds:
Given data , we can compute posterior probabilities and and the posterior odds:
The Bayes factor in favour of against is defined as
where is the marginal likelihood of under hypothesis .
Bayes factor and odds are related via
Interpretation:
: data provide evidence in favour of over .
: data favour .
The Bayes factor measures how much the data change our odds between hypotheses.
Interpreting Bayes factors¶
A rough guideline for interpreting the magnitude of :
1–3: not worth more than a bare mention
3–10: substantial evidence for
10–100: strong evidence for
: decisive evidence for
Similarly, very small values (e.g. ) provide strong evidence in favour of .
These rules of thumb should not be used as rigid thresholds (like for p-values), but rather as a way to interpret the strength of evidence provided by the data.
Two-sided Bayesian tests and ROPEs¶
For continuous parameters, testing an exact point hypothesis such as is problematic because, under a continuous posterior, .
A practical workaround is the concept of a Region of Practical Equivalence (ROPE).
Example:
Suppose we are interested in whether a proportion is effectively equal to some value (e.g. 0.93).
Choose a small tolerance representing a region of values that are practically indistinguishable from .
Define the hypotheses:
Null hypothesis (practical equivalence):
Alternative hypothesis:
We can then compute:
,
,
and posterior odds
If the posterior puts most of its mass inside the ROPE, the data support practical equivalence. If most mass lies outside, the data support a meaningful difference.
Frequentist vs Bayesian hypothesis tests¶
Key conceptual differences:
Frequentist tests:
Specify a null hypothesis and often an alternative .
Compute a test statistic and a p-value .
May reject if the p-value is below a chosen threshold .
Do not provide probabilities for hypotheses themselves.
Bayesian tests:
Assign prior probabilities to hypotheses (or models).
Compute posterior probabilities and posterior odds.
Use Bayes factors to quantify how strongly data support one hypothesis over another.
Allow direct statements such as “Given the data and prior, we believe with that is true”.
The two approaches can yield different practical conclusions, especially when:
There is strong prior belief in a particular hypothesis, and
The observed data under that hypothesis are unlikely but not impossible.
Bayesian methods make it explicit that one surprising data set should not necessarily overturn a well-established theory if the prior evidence for that theory is overwhelming.