Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Continuous Problems and Conjugate Families

From discrete to continuous Bayesian problems

In most real applications, parameters are naturally treated as continuous. For example the proportion π\pi of people who can roll their tongue in a population. In principle, π\pi can take on any value in the interval [0,1][0,1].

Conceptually, nothing changes in the Bayesian workflow:

  1. We specify a likelihood model p(yθ)p(y \mid \theta) for the data yy given parameters θ\theta.

  2. We specify a prior distribution p(θ)p(\theta) over θ\theta.

  3. We use Bayes’ theorem to obtain the posterior p(θy)p(\theta \mid y).

The main difference is that, in the continuous case, we work with probability densities and integrals rather than probability masses and sums.

Binomial distribution: sampling model vs. likelihood

Suppose we observe a count YY of “successes” in nn independent Bernoulli trials, each with success probability π\pi.

The binomial distribution gives the probability of observing exactly kk successes:

P(Y=kπ)=(nk)πk(1π)nk,k=0,1,,n.P(Y = k \mid \pi) = \binom{n}{k} \, \pi^k (1-\pi)^{n-k}, \quad k = 0,1,\dots,n.

This formula has two complementary interpretations:

  • Sampling model (data generating model):
    If the “true” underlying proportion π\pi is known, the binomial distribution tells us how likely different data outcomes kk are.

  • Likelihood function:
    For fixed observed data (n,k)(n,k), we can view

    L(πk,n)=P(Y=kπ)L(\pi \mid k, n) = P(Y = k \mid \pi)

    as a function of π\pi. This tells us which parameter values π\pi make the observed data most likely.

The distinction between sampling model (probability of data given π\pi) and likelihood (function of π\pi for fixed data) is fundamental in both frequentist and Bayesian inference.

Probability density functions (PDFs) and cumulative distribution functions (CDFs)

For continuous random variables, we work with probability density functions (PDFs) instead of mass functions.

Let Θ\Theta be a continuous random variable (e.g. a parameter such as π\pi). Its PDF p(θ)p(\theta) must satisfy:

  1. Non-negativity

    p(θ)0for all θp(\theta) \ge 0 \quad \text{for all } \theta
  2. Normalization

    p(θ)dθ=1\int_{-\infty}^{\infty} p(\theta)\, d\theta = 1

The associated cumulative distribution function (CDF) is

F(θ)=P(Θθ)=θp(t)dt.F(\theta) = P(\Theta \le \theta) = \int_{-\infty}^{\theta} p(t)\, dt.

The expectation (mean) of Θ\Theta is

E[Θ]=θp(θ)dθ,\mathbb{E}[\Theta] = \int_{-\infty}^{\infty} \theta \, p(\theta)\, d\theta,

and the variance is

Var(Θ)=E[(Θμ)2]=(θμ)2p(θ)dθ,where μ=E[Θ].\operatorname{Var}(\Theta) = \mathbb{E}\big[(\Theta - \mu)^2\big] = \int_{-\infty}^{\infty} (\theta - \mu)^2 \, p(\theta)\, d\theta, \quad \text{where } \mu = \mathbb{E}[\Theta].

For parameters restricted to a smaller range (e.g. π[0,1]\pi \in [0,1]), the integration limits are adapted accordingly.

Bayes’ theorem for continuous parameters and marginalisation

Let θ\theta be a continuous parameter (e.g. a proportion π\pi) with prior density p(θ)p(\theta), and let yy denote observed data with likelihood p(yθ)p(y \mid \theta).

Bayes’ theorem (continuous form) says that the posterior density is

p(θy)=p(yθ)p(θ)p(y),p(\theta \mid y) = \frac{p(y \mid \theta)\, p(\theta)}{p(y)},

where the evidence (or marginal likelihood) is

p(y)=p(yθ)p(θ)dθ.p(y) = \int p(y \mid \theta)\, p(\theta)\, d\theta.

The denominator is a marginalisation over all possible parameter values θ\theta — it averages the sampling probability of the data over the prior distribution of θ\theta.

More generally, if we have a partition (or family) of hypotheses HiH_i with prior probabilities P(Hi)P(H_i), the discrete version of the law of total probability is

P(y)=iP(yHi)P(Hi).P(y) = \sum_i P(y \mid H_i)\, P(H_i).

In the continuous case, sums become integrals:

p(y)=p(yθ)p(θ)dθ.p(y) = \int p(y \mid \theta)\, p(\theta)\, d\theta.

This marginalisation step is what makes many continuous problems analytically hard — the integral is often not available in closed form.

Beta distribution as a prior for proportions

For a probability parameter π[0,1]\pi \in [0,1] (e.g. a proportion or success probability), a very common continuous prior family is the beta distribution.

A random variable Π\Pi has a beta distribution with shape parameters α>0\alpha > 0 and β>0\beta > 0, written ΠBeta(α,β)\Pi \sim \operatorname{Beta}(\alpha,\beta), if its PDF is

p(πα,β)=Γ(α+β)Γ(α)Γ(β)πα1(1π)β1,0π1.p(\pi \mid \alpha, \beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\,\Gamma(\beta)} \, \pi^{\alpha-1} (1-\pi)^{\beta-1}, \quad 0 \le \pi \le 1.

Here Γ()\Gamma(\cdot) is the Gamma function, a continuous generalization of the factorial: Γ(n)=(n1)!\Gamma(n) = (n-1)! for positive integers nn.

Important summaries:

  • Mean

    E[Π]=αα+β\mathbb{E}[\Pi] = \frac{\alpha}{\alpha + \beta}
  • Variance

    Var(Π)=αβ(α+β)2(α+β+1).\operatorname{Var}(\Pi) = \frac{\alpha \beta}{(\alpha+\beta)^2(\alpha+\beta+1)}.

Interpretation:

  • α\alpha and β\beta control the shape and concentration of the prior.

  • Roughly speaking, α+β\alpha + \beta acts like a prior sample size:
    large α+β\alpha + \beta means a more concentrated (strongly informed) prior; smaller values mean a more diffuse (weakly informed) prior.

Beta–binomial conjugate family

Consider the binomial sampling model

P(Y=kπ)=(nk)πk(1π)nk,P(Y = k \mid \pi) = \binom{n}{k} \, \pi^k (1-\pi)^{n-k},

and a beta prior for π\pi,

πBeta(α,β).\pi \sim \operatorname{Beta}(\alpha,\beta).

The posterior distribution for π\pi given data (n,k)(n,k) is again a beta distribution:

π(k,n)Beta(α+k,  β+nk).\pi \mid (k,n) \sim \operatorname{Beta}(\alpha + k,\; \beta + n - k).

This is the hallmark of a conjugate family: prior and posterior belong to the same parametric family.

The corresponding posterior mean is

E[πk,n]=α+kα+β+n.\mathbb{E}[\pi \mid k, n] = \frac{\alpha + k}{\alpha + \beta + n}.

This can be rewritten as a weighted average of

  • the prior mean μprior=αα+β\displaystyle \mu_{\text{prior}} = \frac{\alpha}{\alpha+\beta} and

  • the sample proportion π^data=kn\displaystyle \hat\pi_{\text{data}} = \frac{k}{n}:

E[πk,n]=α+βα+β+nμprior  +  nα+β+nπ^data.\mathbb{E}[\pi \mid k, n] = \frac{\alpha + \beta}{\alpha + \beta + n} \, \mu_{\text{prior}} \;+\; \frac{n}{\alpha + \beta + n} \, \hat\pi_{\text{data}}.

So the posterior expectation is a compromise between prior belief and empirical data, where α+β\alpha+\beta and nn play the role of weights.

Principles of Bayesian inference illustrated by the beta–binomial case

The beta–binomial family highlights several general principles of Bayesian inference:

  1. Prior strength vs. data strength

    • When the prior is weak (small α+β\alpha+\beta), the posterior is dominated by the likelihood/data.

    • When the prior is strong (large α+β\alpha+\beta), the posterior remains closer to the prior, and more data are needed to “overcome” it.

  2. Posterior as compromise
    The posterior distribution is always a compromise between prior and likelihood. This is reflected both in the posterior mean and in the posterior shape.

  3. Effect of additional data
    As nn grows larger,

    • the (scaled) likelihood becomes more concentrated (narrower), and

    • the posterior is pulled more towards the data.

    In the limit nn \to \infty, the posterior is dominated by the data (under mild regularity conditions).

  4. Data order invariance
    For independent observations, it does not matter in which order they are processed: sequentially updating the posterior with subsets of data or updating once with all data yields the same posterior.

These properties are not unique to the beta–binomial case, but they are especially easy to see there due to the simple analytic update rule.

Conjugate prior–likelihood families

Let θ\theta be a parameter and yy data. A conjugate prior family for a likelihood p(yθ)p(y \mid \theta) is a family of prior distributions p(θη)p(\theta \mid \eta), parameterized by some hyperparameters η\eta, such that the posterior belongs to the same family:

p(θy,η)    p(yθ)p(θη)is in the same family as p(θη).p(\theta \mid y, \eta) \;\propto\; p(y \mid \theta)\, p(\theta \mid \eta) \quad \text{is in the same family as } p(\theta \mid \eta).

Informally,

multiplying likelihood and prior produces another distribution of the same functional form as the prior.

Examples of conjugate pairs:

  • Binomial likelihood + Beta prior \to Beta posterior (beta–binomial family)

  • Poisson likelihood + Gamma prior \to Gamma posterior (gamma–Poisson family)

  • Normal likelihood (for a mean) + Normal prior \to Normal posterior (normal–normal family)

  • Uniform likelihood on [0,θ][0,\theta] + Pareto prior \to Pareto posterior (Pareto–uniform family)

Advantages:

  • Closed-form update rules for the posterior.

  • Easy computation of posterior summaries (mean, variance, etc.).

Disadvantages:

  • Available only for relatively simple models.

  • Restrict priors to low-dimensional parametric families, which may not always capture realistic prior knowledge.

  • For many practical models, no convenient conjugate prior exists at all.

In those more complex cases, we use numerical methods (e.g. Markov Chain Monte Carlo) to approximate the posterior.

Posterior simulation via joint sampling

Even when an analytic formula is available (as in the beta–binomial case), it’s useful to think in terms of simulation.

One generic idea is:

  1. Sample parameters θi\theta_i from the prior p(θ)p(\theta).

  2. For each θi\theta_i, sample data yiy_i from the likelihood p(yθi)p(y \mid \theta_i).

  3. Keep only those θi\theta_i for which yiy_i equals (or is close to) the observed data yobsy_{\text{obs}}.

  4. The retained θi\theta_i form a sample from the posterior p(θyobs)p(\theta \mid y_{\text{obs}}).

In the simple beta–binomial example, this corresponds to:

  • sampling πiBeta(α,β)\pi_i \sim \operatorname{Beta}(\alpha,\beta),

  • sampling kiBinomial(n,πi)k_i \sim \operatorname{Binomial}(n,\pi_i),

  • retaining only πi\pi_i where ki=kobsk_i = k_{\text{obs}}.

This approach is called posterior simulation via rejection or (in hierarchical settings) ancestral sampling. In practice, more efficient algorithms (such as MCMC) are used for complex models.

Poisson distribution for counting processes

Many counting processes (number of arrivals, number of events in a time interval, etc.) are modeled with the Poisson distribution.

A non-negative integer-valued random variable YY has a Poisson distribution with rate parameter λ>0\lambda > 0, denoted YPoisson(λ)Y \sim \operatorname{Poisson}(\lambda), if

P(Y=yλ)=λyeλy!,y=0,1,2,P(Y = y \mid \lambda) = \frac{\lambda^y e^{-\lambda}}{y!}, \quad y = 0,1,2,\dots

Important properties:

  • Mean

    E[Y]=λ\mathbb{E}[Y] = \lambda
  • Variance

    Var(Y)=λ.\operatorname{Var}(Y) = \lambda.

For independent observations y1,,yny_1,\dots,y_n assumed Poisson with the same rate λ\lambda, the joint likelihood is

P(y1,,ynλ)=i=1nλyieλyi!=(i=1n1yi!)λi=1nyienλ.P(y_1,\dots,y_n \mid \lambda) = \prod_{i=1}^{n} \frac{\lambda^{y_i} e^{-\lambda}}{y_i!} = \left(\prod_{i=1}^{n} \frac{1}{y_i!}\right) \lambda^{\sum_{i=1}^{n} y_i} e^{-n \lambda}.

Up to a constant factor not depending on λ\lambda, the kernel is

λiyienλ.\lambda^{\sum_i y_i} e^{-n\lambda}.

Gamma prior and the gamma–Poisson conjugate family

A common prior for a Poisson rate λ>0\lambda > 0 is the gamma distribution.

A random variable Λ\Lambda has a gamma distribution with shape s>0s > 0 and rate r>0r > 0, written ΛGamma(s,r)\Lambda \sim \operatorname{Gamma}(s,r), if its PDF is

p(λs,r)=rsΓ(s)λs1erλ,λ>0.p(\lambda \mid s,r) = \frac{r^{s}}{\Gamma(s)} \, \lambda^{s-1} e^{-r\lambda}, \quad \lambda > 0.

Summaries:

  • Mean

    E[Λ]=sr\mathbb{E}[\Lambda] = \frac{s}{r}
  • Variance

    Var(Λ)=sr2.\operatorname{Var}(\Lambda) = \frac{s}{r^2}.

Combining a Poisson likelihood with a gamma prior yields the gamma–Poisson conjugate family.

Given independent observations y1,,ynPoisson(λ)y_1,\dots,y_n \sim \operatorname{Poisson}(\lambda) and prior λGamma(s,r)\lambda \sim \operatorname{Gamma}(s,r), the posterior is

λy1,,ynGamma ⁣(s+i=1nyi,  r+n).\lambda \mid y_1,\dots,y_n \sim \operatorname{Gamma}\!\Big(s + \sum_{i=1}^{n} y_i,\; r + n\Big).

Thus, the update rule is:

  • shape: sposterior=sprior+iyis_{\text{posterior}} = s_{\text{prior}} + \sum_i y_i

  • rate: rposterior=rprior+nr_{\text{posterior}} = r_{\text{prior}} + n

Again, prior and posterior are from the same family, illustrating conjugacy.

Normal–normal conjugate family

The normal distribution plays a central role in statistics, partly due to the central limit theorem and partly because it is a maximum entropy distribution under certain constraints.

The PDF of a normal distribution with mean μ\mu and variance σ2\sigma^2 is

p(xμ,σ2)=12πσ2exp((xμ)22σ2).p(x \mid \mu,\sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x-\mu)^2}{2\sigma^2} \right).

Consider the following simple model:

  • Data: y1,,yny_1,\dots,y_n are i.i.d. Normal(μ,σ2)\operatorname{Normal}(\mu, \sigma^2), with known variance σ2\sigma^2.

  • Prior for the mean: μNormal(μ0,τ02)\mu \sim \operatorname{Normal}(\mu_0, \tau_0^2).

The likelihood of the data given μ\mu is

p(y1,,ynμ)=i=1n12πσ2exp((yiμ)22σ2).p(y_1,\dots,y_n \mid \mu) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(y_i - \mu)^2}{2\sigma^2} \right).

Multiplying likelihood and prior yields a normal posterior for μ\mu:

μy1,,ynNormal(μn,τn2),\mu \mid y_1,\dots,y_n \sim \operatorname{Normal}(\mu_n, \tau_n^2),

with

  • posterior variance

    τn2=(1τ02+nσ2)1\tau_n^2 = \left( \frac{1}{\tau_0^2} + \frac{n}{\sigma^2} \right)^{-1}
  • posterior mean

    μn=τn2(μ0τ02+nyˉσ2),yˉ=1ni=1nyi.\mu_n = \tau_n^2 \left( \frac{\mu_0}{\tau_0^2} + \frac{n \, \bar y}{\sigma^2} \right), \quad \bar y = \frac{1}{n} \sum_{i=1}^n y_i.

As in the beta–binomial and gamma–Poisson cases, the posterior mean is a weighted average of the prior mean μ0\mu_0 and the sample mean yˉ\bar y; the posterior variance is smaller than both prior and data variances, reflecting increased information.

Perspective on conjugate families

Although conjugate families (beta–binomial, gamma–Poisson, normal–normal, etc.) provide elegant closed-form solutions, their role in modern Bayesian data analysis is more conceptual than practical:

  1. They provide clean, analytic examples that make it easy to understand how priors, likelihoods, and data interact.

  2. They introduce important probability distributions (beta, gamma, normal, Poisson, …) that are also crucial building blocks in more complex models.

  3. They show how the posterior often becomes a compromise between prior and data, with explicit formulas for how prior strength and sample size interact.

For many realistic models, there is no convenient conjugate prior, and closed-form posteriors are unavailable. In those cases, we turn to numerical methods, especially Markov Chain Monte Carlo (MCMC), which can approximate the posterior without relying on analytic conjugacy.

Even then, we typically still use parametric distributions (such as beta, gamma, and normal) for priors and likelihoods, so the intuition developed from conjugate families remains extremely useful.