Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Estimation, Prediction, Hypothesis Tests

Estimation: summarizing posterior distributions

After performing Bayesian inference, we obtain a posterior distribution p(θd)p(\theta \mid d) for parameters θ\theta given data dd. To communicate results, we usually summarize this distribution with a few key numbers.

For a scalar parameter θ\theta:

  • Posterior mean

    E[θd]=θp(θd)dθ.\mathbb{E}[\theta \mid d] = \int \theta \, p(\theta \mid d)\, d\theta.
  • Posterior variance

    Var(θd)=E[θ2d](E[θd])2=(θμ)2p(θd)dθ,μ=E[θd].\operatorname{Var}(\theta \mid d) = \mathbb{E}[\theta^2 \mid d] - \big(\mathbb{E}[\theta \mid d]\big)^2 = \int (\theta - \mu)^2 \, p(\theta \mid d)\, d\theta, \quad \mu = \mathbb{E}[\theta \mid d].
  • Posterior standard deviation

    SD(θd)=Var(θd).\operatorname{SD}(\theta \mid d) = \sqrt{\operatorname{Var}(\theta \mid d)}.
  • Posterior mode / MAP estimate

    The maximum a posteriori (MAP) estimate is

    θ^MAP=argmaxθp(θd)=argmaxθp(dθ)p(θ),\hat{\theta}_{\text{MAP}} = \arg\max_{\theta} \, p(\theta \mid d) = \arg\max_{\theta} \, p(d \mid \theta)\, p(\theta),

    i.e. the value of θ\theta where the posterior density is largest.

MAP estimate for the beta–binomial model

In the beta–binomial model from earlier weeks,

  • Prior: πBeta(α,β)\pi \sim \operatorname{Beta}(\alpha, \beta),

  • Likelihood: YπBin(n,π)Y \mid \pi \sim \operatorname{Bin}(n, \pi),

the posterior is again beta:

πY=kBeta(α+k,  β+nk).\pi \mid Y = k \sim \operatorname{Beta}(\alpha + k,\; \beta + n - k).

For a beta distribution with shape parameters α>1\alpha' > 1 and β>1\beta' > 1, the mode is

mode(Beta(α,β))=α1α+β2.\operatorname{mode}\big(\operatorname{Beta}(\alpha', \beta')\big) = \frac{\alpha' - 1}{\alpha' + \beta' - 2}.

Thus, in the beta–binomial case, the MAP estimate of π\pi is

π^MAP=α+k1α+β+n2,for α+k>1,  β+nk>1.\hat{\pi}_{\text{MAP}} = \frac{\alpha + k - 1}{\alpha + \beta + n - 2}, \quad \text{for } \alpha + k > 1,\; \beta + n - k > 1.

If the shape parameters are 1\le 1, the density is peaked at the boundaries, and the mode is at 0 or 1.

Estimation from posterior samples

In realistic Bayesian models, the posterior p(θd)p(\theta \mid d) is usually not available in closed form. Instead, we obtain Monte Carlo samples {θi}i=1N\{\theta_i\}_{i=1}^N from p(θd)p(\theta \mid d) via MCMC.

Posterior summaries are then computed as sample statistics:

  • Approximate posterior mean:

    E^[θd]=1Ni=1Nθi.\widehat{\mathbb{E}}[\theta \mid d] = \frac{1}{N} \sum_{i=1}^{N} \theta_i.
  • Approximate posterior variance:

    Var^(θd)=1N1i=1N(θiθˉ)2,θˉ=1Ni=1Nθi.\widehat{\operatorname{Var}}(\theta \mid d) = \frac{1}{N-1} \sum_{i=1}^{N} (\theta_i - \bar{\theta})^2, \quad \bar{\theta} = \frac{1}{N}\sum_{i=1}^{N} \theta_i.
  • Approximate posterior standard deviation:

    SD^(θd)=Var^(θd).\widehat{\operatorname{SD}}(\theta \mid d) = \sqrt{\widehat{\operatorname{Var}}(\theta \mid d)}.

Computing the mode from samples is less direct: one must estimate the density (e.g. via kernel density estimation) and find its maximum numerically.

Credible intervals

A credible interval for a parameter θ\theta is an interval [a,b][a,b] such that the posterior probability that θ\theta lies in the interval equals some chosen level pp (e.g. 0.8, 0.89, 0.95):

P(θ[a,b]d)=abp(θd)dθ=p.P(\theta \in [a,b] \mid d) = \int_{a}^{b} p(\theta \mid d)\, d\theta = p.

For example, an 80% credible interval [a,b][a,b] satisfies

P(θ[a,b]d)=0.8.P(\theta \in [a,b] \mid d) = 0.8.

Two common types of credible intervals:

  1. Central (middle) credible interval of level pp:

    θ[q(1p)/2,  q(1+p)/2],\theta \in \big[ q_{(1-p)/2}, \; q_{(1+p)/2} \big],

    where qrq_r is the rr-quantile of the posterior.

  2. Highest Density Interval (HDI) of level pp: an interval [a,b][a,b] with

    • posterior probability mass pp,

    • and maximum posterior density inside the interval compared to outside.

Credible intervals are also sometimes called compatibility intervals because they describe the range of parameter values most compatible with the observed data and the model.

Highest Density Interval (HDI)

The Highest Density Interval (HDI) of level pp is the smallest interval [a,b][a,b] such that

  • it contains probability mass pp:

    abp(θd)dθ=p,\int_{a}^{b} p(\theta \mid d)\, d\theta = p,
  • every point inside the interval has higher posterior density than any point outside:

    p(θd)p(θd)for all θ[a,b],  θ[a,b].p(\theta \mid d) \ge p(\theta' \mid d) \quad \text{for all } \theta \in [a,b],\; \theta' \notin [a,b].

Intuition:

  • If the posterior is symmetric and unimodal (e.g. nearly normal), the HDI and central interval are very similar.

  • If the posterior is skewed or multimodal, the HDI focuses on the region(s) of highest posterior density.

A conceptual algorithm for computing an HDI from a univariate posterior:

  1. Choose a probability level pp (e.g. 0.8 or 0.94).

  2. Consider all intervals that contain probability mass pp.

  3. Among these, select the interval with the smallest width bab-a.

In practice, for posterior samples, libraries such as ArviZ (az.hdi) approximate the HDI directly.

Note: PyMC / ArviZ often use an HDI probability of 94% by default. This is a reminder that, unlike the conventional 95% frequentist confidence interval, the exact choice of probability (e.g. 89%, 94%, 95%) is somewhat arbitrary and should be chosen to communicate uncertainty effectively.

Credible intervals vs confidence intervals

Bayesian credible interval (CI) for θ\theta (level pp):

  • Definition:

    P(θ[a,b]d)=p.P(\theta \in [a,b] \mid d) = p.
  • Interpretation:

    Given the data and the model, the probability that θ\theta lies in [a,b][a,b] is pp.

  • Probability statements are about the parameter θ\theta (which is uncertain).

Frequentist confidence interval (CI) for θ\theta (level pp):

  • Definition (informally): If we repeatedly collect data sets under identical conditions and construct an interval [a(j),b(j)][a^{(j)},b^{(j)}] from each sample using a specified procedure, then in the long run a fraction pp of these intervals will contain the true parameter θtrue\theta_{\text{true}}.

  • Interpretation:

    The procedure has coverage probability pp; in repeated sampling, pp-fraction of the intervals contain θtrue\theta_{\text{true}}.

  • Incorrect but common statement (to avoid):

    “Our particular confidence interval contains the true value with probability pp.”
    In the frequentist framework, θtrue\theta_{\text{true}} is not random, so this is not strictly correct.

Key differences:

  • Bayesian CIs express a degree of belief about the parameter in light of data and priors.

  • Frequentist CIs are statements about the procedure and long-run frequency properties, not about a single realized interval.

Posterior predictive distribution

Given data yy and parameters θ\theta, the likelihood p(ynewθ)p(y_{\text{new}} \mid \theta) describes the distribution of future or hypothetical observations ynewy_{\text{new}} given a fixed value of θ\theta.

In Bayesian inference, we do not know θ\theta exactly, but we have a posterior p(θy)p(\theta \mid y). The predictive distribution for future data ynewy_{\text{new}} given observed data yy is obtained by marginalising out θ\theta:

p(ynewy)=p(ynewθ)p(θy)dθ.p(y_{\text{new}} \mid y) = \int p(y_{\text{new}} \mid \theta)\, p(\theta \mid y)\, d\theta.

This distribution is called the posterior predictive distribution.

Interpretation:

  • p(ynewy)p(y_{\text{new}} \mid y) is a model-based forecast that averages the likelihood over all plausible values of θ\theta, weighted by their posterior probability.

  • It naturally accounts for parameter uncertainty as well as sampling variability.

Posterior predictive simulation from samples

In practice, we usually have posterior samples {θi}i=1N\{\theta_i\}_{i=1}^{N} from p(θy)p(\theta \mid y).

We can approximate p(ynewy)p(y_{\text{new}} \mid y) by simulation:

  1. Draw θip(θy)\theta_i \sim p(\theta \mid y) (these are the MCMC samples).

  2. For each θi\theta_i, draw

    ynew,ip(ynewθi)y_{\text{new}, i} \sim p(y_{\text{new}} \mid \theta_i)

    from the likelihood (for example, binomial, normal, Poisson, etc.).

  3. The collection {ynew,i}i=1N\{y_{\text{new}, i}\}_{i=1}^{N} is a sample from the posterior predictive distribution.

Posterior predictive summaries:

  • Predictive mean:

    E^[ynewy]=1Ni=1Nynew,i.\widehat{\mathbb{E}}[y_{\text{new}} \mid y] = \frac{1}{N} \sum_{i=1}^N y_{\text{new}, i}.
  • Predictive variance and standard deviation: sample variance and SD of the ynew,iy_{\text{new},i}.

  • Predictive credible intervals: quantiles or HDIs of the ynew,iy_{\text{new},i}.

This method avoids explicit evaluation of the integral

p(ynewθ)p(θy)dθ\int p(y_{\text{new}} \mid \theta)\, p(\theta \mid y)\, d\theta

and works for essentially arbitrary models.

Decomposing predictive uncertainty: aleatoric vs epistemic

The predictive distribution p(ynewy)p(y_{\text{new}} \mid y) mixes two kinds of uncertainty:

  • Aleatoric uncertainty (sampling variability, inherent noise): variability in ynewy_{\text{new}} even if θ\theta were known exactly.

  • Epistemic uncertainty (model/parameter uncertainty): additional variability due to our imperfect knowledge of θ\theta, represented by the posterior p(θy)p(\theta \mid y).

The decomposition is captured by the law of total variance. Let YnewY_{\text{new}} denote a future observation and Θ\Theta the parameter. Then

Var(Ynewy)=EΘy[Var(YnewΘ,y)]  +  VarΘy[E(YnewΘ,y)].\operatorname{Var}(Y_{\text{new}} \mid y) = \mathbb{E}_{\Theta \mid y} \big[\, \operatorname{Var}(Y_{\text{new}} \mid \Theta, y) \,\big] \;+\; \operatorname{Var}_{\Theta \mid y} \big[\, \mathbb{E}(Y_{\text{new}} \mid \Theta, y) \,\big].

Interpretation:

  • The first term

    EΘy[Var(YnewΘ,y)]\mathbb{E}_{\Theta \mid y} \big[\, \operatorname{Var}(Y_{\text{new}} \mid \Theta, y) \,\big]

    is the aleatoric variance: average inherent noise in YnewY_{\text{new}} for a given Θ\Theta, averaged over the posterior of Θ\Theta.

  • The second term

    VarΘy[E(YnewΘ,y)]\operatorname{Var}_{\Theta \mid y} \big[\, \mathbb{E}(Y_{\text{new}} \mid \Theta, y) \,\big]

    is the epistemic variance: variability in the predictive mean due to uncertainty in Θ\Theta.

As we collect more data:

  • the epistemic component typically decreases (posterior concentrates),

  • the aleatoric component remains (it is intrinsic to the data-generating process).

Frequentist hypothesis testing for a proportion

Consider testing a statement about a population proportion π\pi based on binomial data YBin(n,π)Y \sim \operatorname{Bin}(n,\pi).

Example one-sided test:

  • Null hypothesis:

    H0:ππ0H_0: \pi \le \pi_0
  • Alternative hypothesis:

    H1:π>π0.H_1: \pi > \pi_0.

A typical frequentist workflow:

  1. Estimator
    Use the sample proportion

    π^=Yn.\hat{\pi} = \frac{Y}{n}.
  2. Test statistic (approximate zz test)
    Under H0H_0 (with large nn), the standardized test statistic

    Z=π^π0π0(1π0)/nZ = \frac{\hat{\pi} - \pi_0} {\sqrt{\pi_0 (1 - \pi_0) / n}}

    is approximately standard normal: ZN(0,1)Z \approx \mathcal{N}(0,1).

  3. p-value
    For the one-sided test, the p-value is

    p-value=P(ZzobsH0),p\text{-value} = P(Z \ge z_{\text{obs}} \mid H_0),

    where zobsz_{\text{obs}} is the observed test statistic.

  4. Decision
    Choose a significance level α\alpha (e.g. α=0.05\alpha = 0.05). If the p-value α\le \alpha, reject H0H_0; otherwise, do not reject H0H_0.

Note that in the frequentist framework we never assign a probability to H0H_0 or H1H_1 themselves; we only evaluate the probability of data or statistics under H0H_0.

Bayesian hypothesis testing, posterior odds, and Bayes factors

In the Bayesian framework, hypotheses are treated like any other propositions to which we can assign prior and posterior probabilities.

Consider two competing hypotheses H0H_0 and H1H_1 (e.g., statements about a parameter range).

  • Prior probabilities: P(H0)P(H_0) and P(H1)P(H_1)

  • Prior odds:

    P(H1)P(H0).\frac{P(H_1)}{P(H_0)}.

Given data dd, we can compute posterior probabilities P(H0d)P(H_0 \mid d) and P(H1d)P(H_1 \mid d) and the posterior odds:

P(H1d)P(H0d).\frac{P(H_1 \mid d)}{P(H_0 \mid d)}.

The Bayes factor in favour of H1H_1 against H0H_0 is defined as

BF10=p(dH1)p(dH0),\operatorname{BF}_{10} = \frac{p(d \mid H_1)}{p(d \mid H_0)},

where p(dH)p(d \mid H) is the marginal likelihood of dd under hypothesis HH.

Bayes factor and odds are related via

P(H1d)P(H0d)=BF10P(H1)P(H0).\frac{P(H_1 \mid d)}{P(H_0 \mid d)} = \operatorname{BF}_{10} \cdot \frac{P(H_1)}{P(H_0)}.

Interpretation:

  • BF10>1\operatorname{BF}_{10} > 1: data provide evidence in favour of H1H_1 over H0H_0.

  • BF10<1\operatorname{BF}_{10} < 1: data favour H0H_0.

  • The Bayes factor measures how much the data change our odds between hypotheses.

Interpreting Bayes factors

A rough guideline for interpreting the magnitude of BF10\operatorname{BF}_{10}:

  • 13: not worth more than a bare mention

  • 310: substantial evidence for H1H_1

  • 10100: strong evidence for H1H_1

  • >100> 100: decisive evidence for H1H_1

Similarly, very small values (e.g. BF10<1/10\operatorname{BF}_{10} < 1/10) provide strong evidence in favour of H0H_0.

These rules of thumb should not be used as rigid thresholds (like α=0.05\alpha = 0.05 for p-values), but rather as a way to interpret the strength of evidence provided by the data.

Two-sided Bayesian tests and ROPEs

For continuous parameters, testing an exact point hypothesis such as H0:θ=θ0H_0: \theta = \theta_0 is problematic because, under a continuous posterior, P(θ=θ0d)=0P(\theta = \theta_0 \mid d) = 0.

A practical workaround is the concept of a Region of Practical Equivalence (ROPE).

Example:

  • Suppose we are interested in whether a proportion π\pi is effectively equal to some value π0\pi_0 (e.g. 0.93).

  • Choose a small tolerance δ>0\delta > 0 representing a region of values that are practically indistinguishable from π0\pi_0.

Define the hypotheses:

  • Null hypothesis (practical equivalence):

    H0:π[π0δ,  π0+δ]H_0: \pi \in [\pi_0 - \delta,\; \pi_0 + \delta]
  • Alternative hypothesis:

    H1:π[π0δ,  π0+δ].H_1: \pi \notin [\pi_0 - \delta,\; \pi_0 + \delta].

We can then compute:

  • P(H0d)=P(πROPEd)P(H_0 \mid d) = P(\pi \in \text{ROPE} \mid d),

  • P(H1d)=P(πROPEd)P(H_1 \mid d) = P(\pi \notin \text{ROPE} \mid d),

  • and posterior odds

    P(H1d)P(H0d).\frac{P(H_1 \mid d)}{P(H_0 \mid d)}.

If the posterior puts most of its mass inside the ROPE, the data support practical equivalence. If most mass lies outside, the data support a meaningful difference.

Frequentist vs Bayesian hypothesis tests

Key conceptual differences:

  • Frequentist tests:

    • Specify a null hypothesis H0H_0 and often an alternative H1H_1.

    • Compute a test statistic and a p-value P(data or more extremeH0)P(\text{data or more extreme} \mid H_0).

    • May reject H0H_0 if the p-value is below a chosen threshold α\alpha.

    • Do not provide probabilities for hypotheses themselves.

  • Bayesian tests:

    • Assign prior probabilities to hypotheses (or models).

    • Compute posterior probabilities P(Hd)P(H \mid d) and posterior odds.

    • Use Bayes factors to quantify how strongly data support one hypothesis over another.

    • Allow direct statements such as “Given the data and prior, we believe with p%p\% that H1H_1 is true”.

The two approaches can yield different practical conclusions, especially when:

  • There is strong prior belief in a particular hypothesis, and

  • The observed data under that hypothesis are unlikely but not impossible.

Bayesian methods make it explicit that one surprising data set should not necessarily overturn a well-established theory if the prior evidence for that theory is overwhelming.