Continuous Problems and Conjugate Families
From discrete to continuous Bayesian problems¶
In most real applications, parameters are naturally treated as continuous. For example the proportion of people who can roll their tongue in a population. In principle, can take on any value in the interval .
Conceptually, nothing changes in the Bayesian workflow:
We specify a likelihood model for the data given parameters .
We specify a prior distribution over .
We use Bayes’ theorem to obtain the posterior .
The main difference is that, in the continuous case, we work with probability densities and integrals rather than probability masses and sums.
Binomial distribution: sampling model vs. likelihood¶
Suppose we observe a count of “successes” in independent Bernoulli trials, each with success probability .
The binomial distribution gives the probability of observing exactly successes:
This formula has two complementary interpretations:
Sampling model (data generating model):
If the “true” underlying proportion is known, the binomial distribution tells us how likely different data outcomes are.Likelihood function:
For fixed observed data , we can viewas a function of . This tells us which parameter values make the observed data most likely.
The distinction between sampling model (probability of data given ) and likelihood (function of for fixed data) is fundamental in both frequentist and Bayesian inference.
Probability density functions (PDFs) and cumulative distribution functions (CDFs)¶
For continuous random variables, we work with probability density functions (PDFs) instead of mass functions.
Let be a continuous random variable (e.g. a parameter such as ). Its PDF must satisfy:
Non-negativity
Normalization
The associated cumulative distribution function (CDF) is
The expectation (mean) of is
and the variance is
For parameters restricted to a smaller range (e.g. ), the integration limits are adapted accordingly.
Bayes’ theorem for continuous parameters and marginalisation¶
Let be a continuous parameter (e.g. a proportion ) with prior density , and let denote observed data with likelihood .
Bayes’ theorem (continuous form) says that the posterior density is
where the evidence (or marginal likelihood) is
The denominator is a marginalisation over all possible parameter values — it averages the sampling probability of the data over the prior distribution of .
More generally, if we have a partition (or family) of hypotheses with prior probabilities , the discrete version of the law of total probability is
In the continuous case, sums become integrals:
This marginalisation step is what makes many continuous problems analytically hard — the integral is often not available in closed form.
Beta distribution as a prior for proportions¶
For a probability parameter (e.g. a proportion or success probability), a very common continuous prior family is the beta distribution.
A random variable has a beta distribution with shape parameters and , written , if its PDF is
Here is the Gamma function, a continuous generalization of the factorial: for positive integers .
Important summaries:
Mean
Variance
Interpretation:
and control the shape and concentration of the prior.
Roughly speaking, acts like a prior sample size:
large means a more concentrated (strongly informed) prior; smaller values mean a more diffuse (weakly informed) prior.
Beta–binomial conjugate family¶
Consider the binomial sampling model
and a beta prior for ,
The posterior distribution for given data is again a beta distribution:
This is the hallmark of a conjugate family: prior and posterior belong to the same parametric family.
The corresponding posterior mean is
This can be rewritten as a weighted average of
the prior mean and
the sample proportion :
So the posterior expectation is a compromise between prior belief and empirical data, where and play the role of weights.
Principles of Bayesian inference illustrated by the beta–binomial case¶
The beta–binomial family highlights several general principles of Bayesian inference:
Prior strength vs. data strength
When the prior is weak (small ), the posterior is dominated by the likelihood/data.
When the prior is strong (large ), the posterior remains closer to the prior, and more data are needed to “overcome” it.
Posterior as compromise
The posterior distribution is always a compromise between prior and likelihood. This is reflected both in the posterior mean and in the posterior shape.Effect of additional data
As grows larger,the (scaled) likelihood becomes more concentrated (narrower), and
the posterior is pulled more towards the data.
In the limit , the posterior is dominated by the data (under mild regularity conditions).
Data order invariance
For independent observations, it does not matter in which order they are processed: sequentially updating the posterior with subsets of data or updating once with all data yields the same posterior.
These properties are not unique to the beta–binomial case, but they are especially easy to see there due to the simple analytic update rule.
Conjugate prior–likelihood families¶
Let be a parameter and data. A conjugate prior family for a likelihood is a family of prior distributions , parameterized by some hyperparameters , such that the posterior belongs to the same family:
Informally,
multiplying likelihood and prior produces another distribution of the same functional form as the prior.
Examples of conjugate pairs:
Binomial likelihood + Beta prior Beta posterior (beta–binomial family)
Poisson likelihood + Gamma prior Gamma posterior (gamma–Poisson family)
Normal likelihood (for a mean) + Normal prior Normal posterior (normal–normal family)
Uniform likelihood on + Pareto prior Pareto posterior (Pareto–uniform family)
Advantages:
Closed-form update rules for the posterior.
Easy computation of posterior summaries (mean, variance, etc.).
Disadvantages:
Available only for relatively simple models.
Restrict priors to low-dimensional parametric families, which may not always capture realistic prior knowledge.
For many practical models, no convenient conjugate prior exists at all.
In those more complex cases, we use numerical methods (e.g. Markov Chain Monte Carlo) to approximate the posterior.
Posterior simulation via joint sampling¶
Even when an analytic formula is available (as in the beta–binomial case), it’s useful to think in terms of simulation.
One generic idea is:
Sample parameters from the prior .
For each , sample data from the likelihood .
Keep only those for which equals (or is close to) the observed data .
The retained form a sample from the posterior .
In the simple beta–binomial example, this corresponds to:
sampling ,
sampling ,
retaining only where .
This approach is called posterior simulation via rejection or (in hierarchical settings) ancestral sampling. In practice, more efficient algorithms (such as MCMC) are used for complex models.
Poisson distribution for counting processes¶
Many counting processes (number of arrivals, number of events in a time interval, etc.) are modeled with the Poisson distribution.
A non-negative integer-valued random variable has a Poisson distribution with rate parameter , denoted , if
Important properties:
Mean
Variance
For independent observations assumed Poisson with the same rate , the joint likelihood is
Up to a constant factor not depending on , the kernel is
Gamma prior and the gamma–Poisson conjugate family¶
A common prior for a Poisson rate is the gamma distribution.
A random variable has a gamma distribution with shape and rate , written , if its PDF is
Summaries:
Mean
Variance
Combining a Poisson likelihood with a gamma prior yields the gamma–Poisson conjugate family.
Given independent observations and prior , the posterior is
Thus, the update rule is:
shape:
rate:
Again, prior and posterior are from the same family, illustrating conjugacy.
Normal–normal conjugate family¶
The normal distribution plays a central role in statistics, partly due to the central limit theorem and partly because it is a maximum entropy distribution under certain constraints.
The PDF of a normal distribution with mean and variance is
Consider the following simple model:
Data: are i.i.d. , with known variance .
Prior for the mean: .
The likelihood of the data given is
Multiplying likelihood and prior yields a normal posterior for :
with
posterior variance
posterior mean
As in the beta–binomial and gamma–Poisson cases, the posterior mean is a weighted average of the prior mean and the sample mean ; the posterior variance is smaller than both prior and data variances, reflecting increased information.
Perspective on conjugate families¶
Although conjugate families (beta–binomial, gamma–Poisson, normal–normal, etc.) provide elegant closed-form solutions, their role in modern Bayesian data analysis is more conceptual than practical:
They provide clean, analytic examples that make it easy to understand how priors, likelihoods, and data interact.
They introduce important probability distributions (beta, gamma, normal, Poisson, …) that are also crucial building blocks in more complex models.
They show how the posterior often becomes a compromise between prior and data, with explicit formulas for how prior strength and sample size interact.
For many realistic models, there is no convenient conjugate prior, and closed-form posteriors are unavailable. In those cases, we turn to numerical methods, especially Markov Chain Monte Carlo (MCMC), which can approximate the posterior without relying on analytic conjugacy.
Even then, we typically still use parametric distributions (such as beta, gamma, and normal) for priors and likelihoods, so the intuition developed from conjugate families remains extremely useful.