Continuous Problems and Conjugate Families

From discrete to continuous Bayesian problems¶

In most real applications, parameters are naturally treated as continuous. For example the proportion $\pi$ of people who can roll their tongue in a population. In principle, $\pi$ can take on any value in the interval $[0,1]$ .

Conceptually, nothing changes in the Bayesian workflow:

We specify a likelihood model $p(y \mid \theta)$ for the data $y$ given parameters $\theta$ .
We specify a prior distribution $p(\theta)$ over $\theta$ .
We use Bayes’ theorem to obtain the posterior $p(\theta \mid y)$ .

The main difference is that, in the continuous case, we work with probability densities and integrals rather than probability masses and sums.

Binomial distribution: sampling model vs. likelihood¶

Suppose we observe a count $Y$ of “successes” in $n$ independent Bernoulli trials, each with success probability $\pi$ .

The binomial distribution gives the probability of observing exactly $k$ successes:

P(Y = k \mid \pi) = \binom{n}{k} \, \pi^k (1-\pi)^{n-k}, \quad k = 0,1,\dots,n.

(1)

This formula has two complementary interpretations:

Sampling model (data generating model):
If the “true” underlying proportion $\pi$ is known, the binomial distribution tells us how likely different data outcomes $k$ are.
Likelihood function:
For fixed observed data $(n,k)$ , we can view
$L(\pi \mid k, n) = P(Y = k \mid \pi)$
(2)
as a function of $\pi$ . This tells us which parameter values $\pi$ make the observed data most likely.

The distinction between sampling model (probability of data given $\pi$ ) and likelihood (function of $\pi$ for fixed data) is fundamental in both frequentist and Bayesian inference.

Probability density functions (PDFs) and cumulative distribution functions (CDFs)¶

For continuous random variables, we work with probability density functions (PDFs) instead of mass functions.

Let $\Theta$ be a continuous random variable (e.g. a parameter such as $\pi$ ). Its PDF $p(\theta)$ must satisfy:

Non-negativity
$p(\theta) \ge 0 \quad \text{for all } \theta$
(3)
Normalization
$\int_{-\infty}^{\infty} p(\theta)\, d\theta = 1$
(4)

The associated cumulative distribution function (CDF) is

F(\theta) = P(\Theta \le \theta) = \int_{-\infty}^{\theta} p(t)\, dt.

(5)

The expectation (mean) of $\Theta$ is

\mathbb{E}[\Theta] = \int_{-\infty}^{\infty} \theta \, p(\theta)\, d\theta,

(6)

and the variance is

\operatorname{Var}(\Theta) = \mathbb{E}\big[(\Theta - \mu)^2\big] = \int_{-\infty}^{\infty} (\theta - \mu)^2 \, p(\theta)\, d\theta, \quad \text{where } \mu = \mathbb{E}[\Theta].

(7)

For parameters restricted to a smaller range (e.g. $\pi \in [0,1]$ ), the integration limits are adapted accordingly.

Bayes’ theorem for continuous parameters and marginalisation¶

Let $\theta$ be a continuous parameter (e.g. a proportion $\pi$ ) with prior density $p(\theta)$ , and let $y$ denote observed data with likelihood $p(y \mid \theta)$ .

Bayes’ theorem (continuous form) says that the posterior density is

p(\theta \mid y) = \frac{p(y \mid \theta)\, p(\theta)}{p(y)},

(8)

where the evidence (or marginal likelihood) is

p(y) = \int p(y \mid \theta)\, p(\theta)\, d\theta.

(9)

The denominator is a marginalisation over all possible parameter values $\theta$ — it averages the sampling probability of the data over the prior distribution of $\theta$ .

More generally, if we have a partition (or family) of hypotheses $H_i$ with prior probabilities $P(H_i)$ , the discrete version of the law of total probability is

P(y) = \sum_i P(y \mid H_i)\, P(H_i).

(10)

In the continuous case, sums become integrals:

p(y) = \int p(y \mid \theta)\, p(\theta)\, d\theta.

(11)

This marginalisation step is what makes many continuous problems analytically hard — the integral is often not available in closed form.

Beta distribution as a prior for proportions¶

For a probability parameter $\pi \in [0,1]$ (e.g. a proportion or success probability), a very common continuous prior family is the beta distribution.

A random variable $\Pi$ has a beta distribution with shape parameters $\alpha > 0$ and $\beta > 0$ , written $\Pi \sim \operatorname{Beta}(\alpha,\beta)$ , if its PDF is

p(\pi \mid \alpha, \beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\,\Gamma(\beta)} \, \pi^{\alpha-1} (1-\pi)^{\beta-1}, \quad 0 \le \pi \le 1.

(12)

Here $\Gamma(\cdot)$ is the Gamma function, a continuous generalization of the factorial: $\Gamma(n) = (n-1)!$ for positive integers $n$ .

Important summaries:

Mean
$\mathbb{E}[\Pi] = \frac{\alpha}{\alpha + \beta}$
(13)
Variance
$\operatorname{Var}(\Pi) = \frac{\alpha \beta}{(\alpha+\beta)^2(\alpha+\beta+1)}.$
(14)

Interpretation:

$\alpha$ and $\beta$ control the shape and concentration of the prior.
Roughly speaking, $\alpha + \beta$ acts like a prior sample size:
large $\alpha + \beta$ means a more concentrated (strongly informed) prior; smaller values mean a more diffuse (weakly informed) prior.

Beta–binomial conjugate family¶

Consider the binomial sampling model

P(Y = k \mid \pi) = \binom{n}{k} \, \pi^k (1-\pi)^{n-k},

(15)

and a beta prior for $\pi$ ,

\pi \sim \operatorname{Beta}(\alpha,\beta).

(16)

The posterior distribution for $\pi$ given data $(n,k)$ is again a beta distribution:

\pi \mid (k,n) \sim \operatorname{Beta}(\alpha + k,\; \beta + n - k).

(17)

This is the hallmark of a conjugate family: prior and posterior belong to the same parametric family.

The corresponding posterior mean is

\mathbb{E}[\pi \mid k, n] = \frac{\alpha + k}{\alpha + \beta + n}.

(18)

This can be rewritten as a weighted average of

the prior mean $\displaystyle \mu_{\text{prior}} = \frac{\alpha}{\alpha+\beta}$ and
the sample proportion $\displaystyle \hat\pi_{\text{data}} = \frac{k}{n}$ :

\mathbb{E}[\pi \mid k, n] = \frac{\alpha + \beta}{\alpha + \beta + n} \, \mu_{\text{prior}} \;+\; \frac{n}{\alpha + \beta + n} \, \hat\pi_{\text{data}}.

(19)

So the posterior expectation is a compromise between prior belief and empirical data, where $\alpha+\beta$ and $n$ play the role of weights.

Principles of Bayesian inference illustrated by the beta–binomial case¶

The beta–binomial family highlights several general principles of Bayesian inference:

Prior strength vs. data strength
- When the prior is weak (small $\alpha+\beta$ ), the posterior is dominated by the likelihood/data.
- When the prior is strong (large $\alpha+\beta$ ), the posterior remains closer to the prior, and more data are needed to “overcome” it.
Posterior as compromise
The posterior distribution is always a compromise between prior and likelihood. This is reflected both in the posterior mean and in the posterior shape.
Effect of additional data
As $n$ grows larger,
- the (scaled) likelihood becomes more concentrated (narrower), and
- the posterior is pulled more towards the data.
In the limit $n \to \infty$ , the posterior is dominated by the data (under mild regularity conditions).
Data order invariance
For independent observations, it does not matter in which order they are processed: sequentially updating the posterior with subsets of data or updating once with all data yields the same posterior.

These properties are not unique to the beta–binomial case, but they are especially easy to see there due to the simple analytic update rule.

Conjugate prior–likelihood families¶

Let $\theta$ be a parameter and $y$ data. A conjugate prior family for a likelihood $p(y \mid \theta)$ is a family of prior distributions $p(\theta \mid \eta)$ , parameterized by some hyperparameters $\eta$ , such that the posterior belongs to the same family:

p(\theta \mid y, \eta) \;\propto\; p(y \mid \theta)\, p(\theta \mid \eta) \quad \text{is in the same family as } p(\theta \mid \eta).

(20)

Informally,

multiplying likelihood and prior produces another distribution of the same functional form as the prior.

Examples of conjugate pairs:

Binomial likelihood + Beta prior $\to$ Beta posterior (beta–binomial family)
Poisson likelihood + Gamma prior $\to$ Gamma posterior (gamma–Poisson family)
Normal likelihood (for a mean) + Normal prior $\to$ Normal posterior (normal–normal family)
Uniform likelihood on $[0,\theta]$ + Pareto prior $\to$ Pareto posterior (Pareto–uniform family)

Advantages:

Closed-form update rules for the posterior.
Easy computation of posterior summaries (mean, variance, etc.).

Disadvantages:

Available only for relatively simple models.
Restrict priors to low-dimensional parametric families, which may not always capture realistic prior knowledge.
For many practical models, no convenient conjugate prior exists at all.

In those more complex cases, we use numerical methods (e.g. Markov Chain Monte Carlo) to approximate the posterior.

Posterior simulation via joint sampling¶

Even when an analytic formula is available (as in the beta–binomial case), it’s useful to think in terms of simulation.

One generic idea is:

Sample parameters $\theta_i$ from the prior $p(\theta)$ .
For each $\theta_i$ , sample data $y_i$ from the likelihood $p(y \mid \theta_i)$ .
Keep only those $\theta_i$ for which $y_i$ equals (or is close to) the observed data $y_{\text{obs}}$ .
The retained $\theta_i$ form a sample from the posterior $p(\theta \mid y_{\text{obs}})$ .

In the simple beta–binomial example, this corresponds to:

sampling $\pi_i \sim \operatorname{Beta}(\alpha,\beta)$ ,
sampling $k_i \sim \operatorname{Binomial}(n,\pi_i)$ ,
retaining only $\pi_i$ where $k_i = k_{\text{obs}}$ .

This approach is called posterior simulation via rejection or (in hierarchical settings) ancestral sampling. In practice, more efficient algorithms (such as MCMC) are used for complex models.

Poisson distribution for counting processes¶

Many counting processes (number of arrivals, number of events in a time interval, etc.) are modeled with the Poisson distribution.

A non-negative integer-valued random variable $Y$ has a Poisson distribution with rate parameter $\lambda > 0$ , denoted $Y \sim \operatorname{Poisson}(\lambda)$ , if

P(Y = y \mid \lambda) = \frac{\lambda^y e^{-\lambda}}{y!}, \quad y = 0,1,2,\dots

(21)

Important properties:

Mean
$\mathbb{E}[Y] = \lambda$
(22)
Variance
$\operatorname{Var}(Y) = \lambda.$
(23)

For independent observations $y_1,\dots,y_n$ assumed Poisson with the same rate $\lambda$ , the joint likelihood is

P(y_1,\dots,y_n \mid \lambda) = \prod_{i=1}^{n} \frac{\lambda^{y_i} e^{-\lambda}}{y_i!} = \left(\prod_{i=1}^{n} \frac{1}{y_i!}\right) \lambda^{\sum_{i=1}^{n} y_i} e^{-n \lambda}.

(24)

Up to a constant factor not depending on $\lambda$ , the kernel is

\lambda^{\sum_i y_i} e^{-n\lambda}.

(25)

Gamma prior and the gamma–Poisson conjugate family¶

A common prior for a Poisson rate $\lambda > 0$ is the gamma distribution.

A random variable $\Lambda$ has a gamma distribution with shape $s > 0$ and rate $r > 0$ , written $\Lambda \sim \operatorname{Gamma}(s,r)$ , if its PDF is

p(\lambda \mid s,r) = \frac{r^{s}}{\Gamma(s)} \, \lambda^{s-1} e^{-r\lambda}, \quad \lambda > 0.

(26)

Summaries:

Mean
$\mathbb{E}[\Lambda] = \frac{s}{r}$
(27)
Variance
$\operatorname{Var}(\Lambda) = \frac{s}{r^2}.$
(28)

Combining a Poisson likelihood with a gamma prior yields the gamma–Poisson conjugate family.

Given independent observations $y_1,\dots,y_n \sim \operatorname{Poisson}(\lambda)$ and prior $\lambda \sim \operatorname{Gamma}(s,r)$ , the posterior is

\lambda \mid y_1,\dots,y_n \sim \operatorname{Gamma}\!\Big(s + \sum_{i=1}^{n} y_i,\; r + n\Big).

(29)

Thus, the update rule is:

shape: $s_{\text{posterior}} = s_{\text{prior}} + \sum_i y_i$
rate: $r_{\text{posterior}} = r_{\text{prior}} + n$

Again, prior and posterior are from the same family, illustrating conjugacy.

Normal–normal conjugate family¶

The normal distribution plays a central role in statistics, partly due to the central limit theorem and partly because it is a maximum entropy distribution under certain constraints.

The PDF of a normal distribution with mean $\mu$ and variance $\sigma^2$ is

p(x \mid \mu,\sigma^2) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(x-\mu)^2}{2\sigma^2} \right).

(30)

Consider the following simple model:

Data: $y_1,\dots,y_n$ are i.i.d. $\operatorname{Normal}(\mu, \sigma^2)$ , with known variance $\sigma^2$ .
Prior for the mean: $\mu \sim \operatorname{Normal}(\mu_0, \tau_0^2)$ .

The likelihood of the data given $\mu$ is

p(y_1,\dots,y_n \mid \mu) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left( -\frac{(y_i - \mu)^2}{2\sigma^2} \right).

(31)

Multiplying likelihood and prior yields a normal posterior for $\mu$ :

\mu \mid y_1,\dots,y_n \sim \operatorname{Normal}(\mu_n, \tau_n^2),

(32)

with

posterior variance
$\tau_n^2 = \left( \frac{1}{\tau_0^2} + \frac{n}{\sigma^2} \right)^{-1}$
(33)
posterior mean
$\mu_n = \tau_n^2 \left( \frac{\mu_0}{\tau_0^2} + \frac{n \, \bar y}{\sigma^2} \right), \quad \bar y = \frac{1}{n} \sum_{i=1}^n y_i.$
(34)

As in the beta–binomial and gamma–Poisson cases, the posterior mean is a weighted average of the prior mean $\mu_0$ and the sample mean $\bar y$ ; the posterior variance is smaller than both prior and data variances, reflecting increased information.

Perspective on conjugate families¶

Although conjugate families (beta–binomial, gamma–Poisson, normal–normal, etc.) provide elegant closed-form solutions, their role in modern Bayesian data analysis is more conceptual than practical:

They provide clean, analytic examples that make it easy to understand how priors, likelihoods, and data interact.
They introduce important probability distributions (beta, gamma, normal, Poisson, …) that are also crucial building blocks in more complex models.
They show how the posterior often becomes a compromise between prior and data, with explicit formulas for how prior strength and sample size interact.

For many realistic models, there is no convenient conjugate prior, and closed-form posteriors are unavailable. In those cases, we turn to numerical methods, especially Markov Chain Monte Carlo (MCMC), which can approximate the posterior without relying on analytic conjugacy.

Even then, we typically still use parametric distributions (such as beta, gamma, and normal) for priors and likelihoods, so the intuition developed from conjugate families remains extremely useful.