Estimation, Prediction, Hypothesis Tests

Estimation: summarizing posterior distributions¶

After performing Bayesian inference, we obtain a posterior distribution $p(\theta \mid d)$ for parameters $\theta$ given data $d$ . To communicate results, we usually summarize this distribution with a few key numbers.

For a scalar parameter $\theta$ :

Posterior mean
$\mathbb{E}[\theta \mid d] = \int \theta \, p(\theta \mid d)\, d\theta.$
(1)
Posterior variance
$\operatorname{Var}(\theta \mid d) = \mathbb{E}[\theta^2 \mid d] - \big(\mathbb{E}[\theta \mid d]\big)^2 = \int (\theta - \mu)^2 \, p(\theta \mid d)\, d\theta, \quad \mu = \mathbb{E}[\theta \mid d].$
(2)
Posterior standard deviation
$\operatorname{SD}(\theta \mid d) = \sqrt{\operatorname{Var}(\theta \mid d)}.$
(3)
Posterior mode / MAP estimate
The maximum a posteriori (MAP) estimate is
$\hat{\theta}_{\text{MAP}} = \arg\max_{\theta} \, p(\theta \mid d) = \arg\max_{\theta} \, p(d \mid \theta)\, p(\theta),$
(4)
i.e. the value of $\theta$ where the posterior density is largest.

MAP estimate for the beta–binomial model¶

In the beta–binomial model from earlier weeks,

Prior: $\pi \sim \operatorname{Beta}(\alpha, \beta)$ ,
Likelihood: $Y \mid \pi \sim \operatorname{Bin}(n, \pi)$ ,

the posterior is again beta:

\pi \mid Y = k \sim \operatorname{Beta}(\alpha + k,\; \beta + n - k).

(5)

For a beta distribution with shape parameters $\alpha' > 1$ and $\beta' > 1$ , the mode is

\operatorname{mode}\big(\operatorname{Beta}(\alpha', \beta')\big) = \frac{\alpha' - 1}{\alpha' + \beta' - 2}.

(6)

Thus, in the beta–binomial case, the MAP estimate of $\pi$ is

\hat{\pi}_{\text{MAP}} = \frac{\alpha + k - 1}{\alpha + \beta + n - 2}, \quad \text{for } \alpha + k > 1,\; \beta + n - k > 1.

(7)

If the shape parameters are $\le 1$ , the density is peaked at the boundaries, and the mode is at 0 or 1.

Estimation from posterior samples¶

In realistic Bayesian models, the posterior $p(\theta \mid d)$ is usually not available in closed form. Instead, we obtain Monte Carlo samples $\{\theta_i\}_{i=1}^N$ from $p(\theta \mid d)$ via MCMC.

Posterior summaries are then computed as sample statistics:

Approximate posterior mean:
$\widehat{\mathbb{E}}[\theta \mid d] = \frac{1}{N} \sum_{i=1}^{N} \theta_i.$
(8)
Approximate posterior variance:
$\widehat{\operatorname{Var}}(\theta \mid d) = \frac{1}{N-1} \sum_{i=1}^{N} (\theta_i - \bar{\theta})^2, \quad \bar{\theta} = \frac{1}{N}\sum_{i=1}^{N} \theta_i.$
(9)
Approximate posterior standard deviation:
$\widehat{\operatorname{SD}}(\theta \mid d) = \sqrt{\widehat{\operatorname{Var}}(\theta \mid d)}.$
(10)

Computing the mode from samples is less direct: one must estimate the density (e.g. via kernel density estimation) and find its maximum numerically.

Credible intervals¶

A credible interval for a parameter $\theta$ is an interval $[a,b]$ such that the posterior probability that $\theta$ lies in the interval equals some chosen level $p$ (e.g. 0.8, 0.89, 0.95):

P(\theta \in [a,b] \mid d) = \int_{a}^{b} p(\theta \mid d)\, d\theta = p.

(11)

For example, an 80% credible interval $[a,b]$ satisfies

P(\theta \in [a,b] \mid d) = 0.8.

(12)

Two common types of credible intervals:

Central (middle) credible interval of level $p$ :
$\theta \in \big[ q_{(1-p)/2}, \; q_{(1+p)/2} \big],$
(13)
where $q_r$ is the $r$ -quantile of the posterior.
Highest Density Interval (HDI) of level $p$ : an interval $[a,b]$ with
- posterior probability mass $p$ ,
- and maximum posterior density inside the interval compared to outside.

Credible intervals are also sometimes called compatibility intervals because they describe the range of parameter values most compatible with the observed data and the model.

Highest Density Interval (HDI)¶

The Highest Density Interval (HDI) of level $p$ is the smallest interval $[a,b]$ such that

it contains probability mass $p$ :
$\int_{a}^{b} p(\theta \mid d)\, d\theta = p,$
(14)
every point inside the interval has higher posterior density than any point outside:
$p(\theta \mid d) \ge p(\theta' \mid d) \quad \text{for all } \theta \in [a,b],\; \theta' \notin [a,b].$
(15)

Intuition:

If the posterior is symmetric and unimodal (e.g. nearly normal), the HDI and central interval are very similar.
If the posterior is skewed or multimodal, the HDI focuses on the region(s) of highest posterior density.

A conceptual algorithm for computing an HDI from a univariate posterior:

Choose a probability level $p$ (e.g. 0.8 or 0.94).
Consider all intervals that contain probability mass $p$ .
Among these, select the interval with the smallest width $b-a$ .

In practice, for posterior samples, libraries such as ArviZ (az.hdi) approximate the HDI directly.

Note: PyMC / ArviZ often use an HDI probability of 94% by default. This is a reminder that, unlike the conventional 95% frequentist confidence interval, the exact choice of probability (e.g. 89%, 94%, 95%) is somewhat arbitrary and should be chosen to communicate uncertainty effectively.

Credible intervals vs confidence intervals¶

Bayesian credible interval (CI) for $\theta$ (level $p$ ):

Definition:
$P(\theta \in [a,b] \mid d) = p.$
(16)
Interpretation:
Given the data and the model, the probability that $\theta$ lies in $[a,b]$ is $p$ .
Probability statements are about the parameter $\theta$ (which is uncertain).

Frequentist confidence interval (CI) for $\theta$ (level $p$ ):

Definition (informally): If we repeatedly collect data sets under identical conditions and construct an interval $[a^{(j)},b^{(j)}]$ from each sample using a specified procedure, then in the long run a fraction $p$ of these intervals will contain the true parameter $\theta_{\text{true}}$ .
Interpretation:
The procedure has coverage probability $p$ ; in repeated sampling, $p$ -fraction of the intervals contain $\theta_{\text{true}}$ .
Incorrect but common statement (to avoid):
“Our particular confidence interval contains the true value with probability $p$ .”
In the frequentist framework, $\theta_{\text{true}}$ is not random, so this is not strictly correct.

Key differences:

Bayesian CIs express a degree of belief about the parameter in light of data and priors.
Frequentist CIs are statements about the procedure and long-run frequency properties, not about a single realized interval.

Posterior predictive distribution¶

Given data $y$ and parameters $\theta$ , the likelihood $p(y_{\text{new}} \mid \theta)$ describes the distribution of future or hypothetical observations $y_{\text{new}}$ given a fixed value of $\theta$ .

In Bayesian inference, we do not know $\theta$ exactly, but we have a posterior $p(\theta \mid y)$ . The predictive distribution for future data $y_{\text{new}}$ given observed data $y$ is obtained by marginalising out $\theta$ :

p(y_{\text{new}} \mid y) = \int p(y_{\text{new}} \mid \theta)\, p(\theta \mid y)\, d\theta.

(17)

This distribution is called the posterior predictive distribution.

Interpretation:

$p(y_{\text{new}} \mid y)$ is a model-based forecast that averages the likelihood over all plausible values of $\theta$ , weighted by their posterior probability.
It naturally accounts for parameter uncertainty as well as sampling variability.

Posterior predictive simulation from samples¶

In practice, we usually have posterior samples $\{\theta_i\}_{i=1}^{N}$ from $p(\theta \mid y)$ .

We can approximate $p(y_{\text{new}} \mid y)$ by simulation:

Draw $\theta_i \sim p(\theta \mid y)$ (these are the MCMC samples).
For each $\theta_i$ , draw
$y_{\text{new}, i} \sim p(y_{\text{new}} \mid \theta_i)$
(18)
from the likelihood (for example, binomial, normal, Poisson, etc.).
The collection $\{y_{\text{new}, i}\}_{i=1}^{N}$ is a sample from the posterior predictive distribution.

Posterior predictive summaries:

Predictive mean:
$\widehat{\mathbb{E}}[y_{\text{new}} \mid y] = \frac{1}{N} \sum_{i=1}^N y_{\text{new}, i}.$
(19)
Predictive variance and standard deviation: sample variance and SD of the $y_{\text{new},i}$ .
Predictive credible intervals: quantiles or HDIs of the $y_{\text{new},i}$ .

This method avoids explicit evaluation of the integral

\int p(y_{\text{new}} \mid \theta)\, p(\theta \mid y)\, d\theta

(20)

and works for essentially arbitrary models.

Decomposing predictive uncertainty: aleatoric vs epistemic¶

The predictive distribution $p(y_{\text{new}} \mid y)$ mixes two kinds of uncertainty:

Aleatoric uncertainty (sampling variability, inherent noise): variability in $y_{\text{new}}$ even if $\theta$ were known exactly.
Epistemic uncertainty (model/parameter uncertainty): additional variability due to our imperfect knowledge of $\theta$ , represented by the posterior $p(\theta \mid y)$ .

The decomposition is captured by the law of total variance. Let $Y_{\text{new}}$ denote a future observation and $\Theta$ the parameter. Then

\operatorname{Var}(Y_{\text{new}} \mid y) = \mathbb{E}_{\Theta \mid y} \big[\, \operatorname{Var}(Y_{\text{new}} \mid \Theta, y) \,\big] \;+\; \operatorname{Var}_{\Theta \mid y} \big[\, \mathbb{E}(Y_{\text{new}} \mid \Theta, y) \,\big].

(21)

Interpretation:

The first term
$\mathbb{E}_{\Theta \mid y} \big[\, \operatorname{Var}(Y_{\text{new}} \mid \Theta, y) \,\big]$
(22)
is the aleatoric variance: average inherent noise in $Y_{\text{new}}$ for a given $\Theta$ , averaged over the posterior of $\Theta$ .
The second term
$\operatorname{Var}_{\Theta \mid y} \big[\, \mathbb{E}(Y_{\text{new}} \mid \Theta, y) \,\big]$
(23)
is the epistemic variance: variability in the predictive mean due to uncertainty in $\Theta$ .

As we collect more data:

the epistemic component typically decreases (posterior concentrates),
the aleatoric component remains (it is intrinsic to the data-generating process).

Frequentist hypothesis testing for a proportion¶

Consider testing a statement about a population proportion $\pi$ based on binomial data $Y \sim \operatorname{Bin}(n,\pi)$ .

Example one-sided test:

Null hypothesis:
$H_0: \pi \le \pi_0$
(24)
Alternative hypothesis:
$H_1: \pi > \pi_0.$
(25)

A typical frequentist workflow:

Estimator
Use the sample proportion
$\hat{\pi} = \frac{Y}{n}.$
(26)
Test statistic (approximate $z$ test)
Under $H_0$ (with large $n$ ), the standardized test statistic
$Z = \frac{\hat{\pi} - \pi_0} {\sqrt{\pi_0 (1 - \pi_0) / n}}$
(27)
is approximately standard normal: $Z \approx \mathcal{N}(0,1)$ .
p-value
For the one-sided test, the p-value is
$p\text{-value} = P(Z \ge z_{\text{obs}} \mid H_0),$
(28)
where $z_{\text{obs}}$ is the observed test statistic.
Decision
Choose a significance level $\alpha$ (e.g. $\alpha = 0.05$ ). If the p-value $\le \alpha$ , reject $H_0$ ; otherwise, do not reject $H_0$ .

Note that in the frequentist framework we never assign a probability to $H_0$ or $H_1$ themselves; we only evaluate the probability of data or statistics under $H_0$ .

Bayesian hypothesis testing, posterior odds, and Bayes factors¶

In the Bayesian framework, hypotheses are treated like any other propositions to which we can assign prior and posterior probabilities.

Consider two competing hypotheses $H_0$ and $H_1$ (e.g., statements about a parameter range).

Prior probabilities: $P(H_0)$ and $P(H_1)$
Prior odds:
$\frac{P(H_1)}{P(H_0)}.$
(29)

Given data $d$ , we can compute posterior probabilities $P(H_0 \mid d)$ and $P(H_1 \mid d)$ and the posterior odds:

\frac{P(H_1 \mid d)}{P(H_0 \mid d)}.

(30)

The Bayes factor in favour of $H_1$ against $H_0$ is defined as

\operatorname{BF}_{10} = \frac{p(d \mid H_1)}{p(d \mid H_0)},

(31)

where $p(d \mid H)$ is the marginal likelihood of $d$ under hypothesis $H$ .

Bayes factor and odds are related via

\frac{P(H_1 \mid d)}{P(H_0 \mid d)} = \operatorname{BF}_{10} \cdot \frac{P(H_1)}{P(H_0)}.

(32)

Interpretation:

$\operatorname{BF}_{10} > 1$ : data provide evidence in favour of $H_1$ over $H_0$ .
$\operatorname{BF}_{10} < 1$ : data favour $H_0$ .
The Bayes factor measures how much the data change our odds between hypotheses.

Interpreting Bayes factors¶

A rough guideline for interpreting the magnitude of $\operatorname{BF}_{10}$ :

1–3: not worth more than a bare mention
3–10: substantial evidence for $H_1$
10–100: strong evidence for $H_1$
$> 100$ : decisive evidence for $H_1$

Similarly, very small values (e.g. $\operatorname{BF}_{10} < 1/10$ ) provide strong evidence in favour of $H_0$ .

These rules of thumb should not be used as rigid thresholds (like $\alpha = 0.05$ for p-values), but rather as a way to interpret the strength of evidence provided by the data.

Two-sided Bayesian tests and ROPEs¶

For continuous parameters, testing an exact point hypothesis such as $H_0: \theta = \theta_0$ is problematic because, under a continuous posterior, $P(\theta = \theta_0 \mid d) = 0$ .

A practical workaround is the concept of a Region of Practical Equivalence (ROPE).

Example:

Suppose we are interested in whether a proportion $\pi$ is effectively equal to some value $\pi_0$ (e.g. 0.93).
Choose a small tolerance $\delta > 0$ representing a region of values that are practically indistinguishable from $\pi_0$ .

Define the hypotheses:

Null hypothesis (practical equivalence):
$H_0: \pi \in [\pi_0 - \delta,\; \pi_0 + \delta]$
(33)
Alternative hypothesis:
$H_1: \pi \notin [\pi_0 - \delta,\; \pi_0 + \delta].$
(34)

We can then compute:

$P(H_0 \mid d) = P(\pi \in \text{ROPE} \mid d)$ ,
$P(H_1 \mid d) = P(\pi \notin \text{ROPE} \mid d)$ ,
and posterior odds
$\frac{P(H_1 \mid d)}{P(H_0 \mid d)}.$
(35)

If the posterior puts most of its mass inside the ROPE, the data support practical equivalence. If most mass lies outside, the data support a meaningful difference.

Frequentist vs Bayesian hypothesis tests¶

Key conceptual differences:

Frequentist tests:
- Specify a null hypothesis $H_0$ and often an alternative $H_1$ .
- Compute a test statistic and a p-value $P(\text{data or more extreme} \mid H_0)$ .
- May reject $H_0$ if the p-value is below a chosen threshold $\alpha$ .
- Do not provide probabilities for hypotheses themselves.
Bayesian tests:
- Assign prior probabilities to hypotheses (or models).
- Compute posterior probabilities $P(H \mid d)$ and posterior odds.
- Use Bayes factors to quantify how strongly data support one hypothesis over another.
- Allow direct statements such as “Given the data and prior, we believe with $p\%$ that $H_1$ is true”.

The two approaches can yield different practical conclusions, especially when:

There is strong prior belief in a particular hypothesis, and
The observed data under that hypothesis are unlikely but not impossible.

Bayesian methods make it explicit that one surprising data set should not necessarily overturn a well-established theory if the prior evidence for that theory is overwhelming.