Numerical Simulations, Markov Chain Monte Carlo
The computational curse of Bayesian statistics¶
In Bayesian inference we want the posterior distribution of parameters given data :
The denominator is the evidence (or marginal likelihood)
or, for multiple parameters ,
This integral is often not available in closed form except for special conjugate families. Reasons:
The prior or likelihood may be multimodal, non-Gaussian, or non-parametric.
The parameter space is often high-dimensional (many parameters or latent variables).
Realistic models (e.g. hierarchical models) create complicated dependency structures.
Because of this, the central problem of practical Bayesian statistics is numerical approximation of the posterior.
Grid approximation and the curse of dimensionality¶
A simple numerical approach is grid approximation:
Choose a grid of parameter values .
Evaluate the unnormalized posterior
at each grid point.
Normalize:
For a single parameter, a grid can work reasonably well.
However, for parameters, if we use grid points per dimension, we need
grid points in total.
This is the curse of dimensionality:
Work grows exponentially with dimension .
Even modest values (e.g. , ) become infeasible: grid points.
Hence, naive grid approximation is only practical for very low-dimensional problems where we roughly know the region of interest in advance.
Monte Carlo methods and Monte Carlo integration¶
Monte Carlo methods use random samples to solve deterministic problems such as computing expectations or integrals.
Suppose we want to compute the expectation of a function under a distribution with PDF :
If we can sample independent draws , a Monte Carlo estimator is
By the law of large numbers, this converges to the true expectation as .
For Bayesian inference, we would love to sample from the posterior and approximate posterior expectations this way:
The problem: we do not know how to sample from directly. Plain Monte Carlo integration with the prior is often very inefficient, because most prior samples fall into regions of very low likelihood.
Markov processes and Markov chains¶
To construct better sampling algorithms, we introduce Markov processes.
A stochastic process with values in some state space is a Markov process if it satisfies the Markov property:
for all and states .
For a discrete state space , the process is characterized by:
Transition probabilities
collected in a transition matrix with entries and .
A sequence of states generated by such a process is called a Markov chain.
The key idea for MCMC is: construct a Markov chain whose stationary distribution is the posterior . Then, the chain will spend time in different regions of parameter space in proportion to their posterior probability.
Random walks as simple continuous Markov processes¶
A simple continuous-state Markov process is a random walk.
One-dimensional example:
where each is an independent Gaussian “step”. The transition density (proposal) is
Multivariate example in :
with covariance matrix controlling the shape and step sizes in different directions.
Random walks are easy to simulate, but by themselves they do not target the posterior. We need to bias the random walk so that it explores the posterior distribution.
The Metropolis algorithm¶
The Metropolis algorithm is a classic Markov Chain Monte Carlo (MCMC) method. It constructs a Markov chain whose stationary distribution is the posterior .
We assume we can evaluate the unnormalized posterior
up to a constant factor (we do not need ).
Algorithm outline:
Initialization
Choose an initial parameter value .For :
(a) Propose a new state
Sample a proposale.g. a random walk proposal .
(b) Compute acceptance probability
For a symmetric proposal density (), define(c) Accept or reject
Draw and set
Intuition:
Moves to higher posterior density are always accepted (ratio ).
Moves to lower posterior density are sometimes accepted, with probability equal to the ratio of posterior densities.
The resulting Markov chain visits regions in proportion to their posterior probability mass.
Acceptance probability and detailed balance¶
Let denote the target posterior (up to normalization). The Metropolis algorithm defines transition probabilities
For :
For staying at the same point:
For a symmetric proposal density and acceptance probability
one can verify the detailed balance condition
for all . Detailed balance implies that is a stationary distribution of the Markov chain.
This is the mathematical reason why the Metropolis algorithm samples from the posterior in the long run.
Hastings extension for non-symmetric proposals¶
Hastings generalized the Metropolis algorithm to allow non-symmetric proposal distributions .
The modified acceptance probability is
With this choice, detailed balance still holds:
where .
Special case:
If is symmetric, i.e. , the -terms cancel and we recover the standard Metropolis acceptance rule.
Diagnosing MCMC: trace plots and autocorrelation¶
Because MCMC produces correlated samples, we must check whether our chain converged and mixes well.
Two basic diagnostic ideas:
Trace plots
Plot against iteration . A well-mixing chainexplores all relevant regions of the posterior,
does not get “stuck” in a small sub-region,
does not show long-term trends after the burn-in phase.
Autocorrelation function (ACF)
For a (stationary) stochastic process with mean and variance , the theoretical autocorrelation at lag isProperties:
(perfect correlation with itself),
for a well-mixing chain, should drop to near zero for relatively small lags ,
independent (“white noise”) samples have for all .
In practice, we compute sample autocorrelations from the MCMC output and inspect how fast they decay.
Diagnostics: (Rhat)¶
A single chain can be misleading. A common practice is to run multiple chains from different starting points and compare them.
The Gelman–Rubin statistic (often denoted ) summarizes how much the chains agree with each other.
Suppose we run chains, each of length , and let be the -th draw from chain .
Define:
Chain means:
Overall mean:
Between-chain variance:
Within-chain variance:
An estimator of the marginal posterior variance is
The potential scale reduction factor is
Interpretation:
If all chains have mixed well and explore the same distribution, we expect and thus .
Values substantially larger than 1 indicate disagreement between chains (lack of convergence).
A common rule of thumb: require for all monitored parameters.
Diagnostics: effective sample size (ESS)¶
Because MCMC draws are autocorrelated, draws from a chain contain less information than independent samples.
The effective sample size (ESS) quantifies how many independent samples the correlated draws are “worth”.
For a single chain of length with (theoretical) autocorrelations , the ESS for a scalar parameter can be approximated by
If the chain is nearly independent (autocorrelations near zero), then .
If the chain mixes slowly (large positive autocorrelations), can be much smaller than .
A practical rule of thumb from the slides:
Aim for a ratio
(effective sample size larger than about 10% of the nominal number of draws).
In practice, modern software (PyMC, Stan, etc.) computes and ESS automatically for each parameter.
Advanced MCMC: Hamiltonian Monte Carlo (HMC)¶
The Metropolis algorithm is simple but can be very inefficient for complex or high-dimensional posteriors:
Proposals are uninformed about the shape of the posterior.
Large proposed steps are often rejected; small steps lead to strong autocorrelation and slow exploration.
Manual tuning of the proposal scale is required.
Hamiltonian Monte Carlo (HMC) addresses these issues by borrowing ideas from classical mechanics.
We introduce:
Position: (the parameter vector),
Momentum: (an auxiliary variable),
Potential energy:
Kinetic energy (for a mass matrix ):
Hamiltonian (total energy):
The joint target distribution over is
Hamilton’s equations of motion are
By approximately integrating these equations in time, HMC produces proposals that move along contours of constant energy, exploring the posterior in long, coherent trajectories with relatively low rejection rates.
HMC algorithm: leapfrog integration and Metropolis correction¶
A sketch of one HMC update step:
Current state: .
Sample a fresh momentum
Simulate Hamiltonian dynamics
Use a leapfrog integrator with step size for steps to map to a proposal .
The leapfrog integrator approximately conserves the Hamiltonian while being reversible and volume-preserving.
Metropolis acceptance step
Accept or reject the proposal with probabilityIf accepted, set .
If rejected, set .
Advantages over simple random-walk Metropolis:
Much longer steps in parameter space with relatively high acceptance rate.
Strongly reduced autocorrelation between successive samples (higher ESS for the same ).
Better performance in higher dimensions and for correlated parameters.
However, HMC still requires tuning of step size , number of steps , and the mass matrix .
NUTS: the No-U-Turn Sampler¶
The No-U-Turn Sampler (NUTS) is an extension of HMC designed to automate much of the tuning:
It adaptively chooses the number of leapfrog steps so that trajectories do not “turn back” on themselves (no U-turns).
It also adapts the step size during a warm-up phase.
In practice, NUTS is a state-of-the-art general-purpose MCMC algorithm for many Bayesian models.
Most modern probabilistic programming frameworks (PyMC, Stan, etc.) use variants of HMC / NUTS as their default posterior sampling algorithms.
Probabilistic programming with PyMC¶
Probabilistic programming libraries automate many of the steps in Bayesian analysis:
You specify:
priors for parameters,
a likelihood model for the data,
the library handles:
construction of the joint and posterior distributions,
running MCMC (typically NUTS / HMC),
producing diagnostics (trace plots, , ESS),
summarizing posterior distributions.
In PyMC, the workflow is roughly:
Define a model using
with pm.Model():and PyMC’s random variable primitives.Call
pm.sample()to run MCMC and obtain a trace (posterior samples).Use tools like:
pm.plot_trace(...)for trace plots,pm.plot_autocorr(...)for autocorrelation,pm.rhat(...)for ,pm.ess(...)for effective sample size,az.summary(...)(from ArviZ) for tabular summaries.
This allows you to focus on modeling rather than on implementing MCMC algorithms and diagnostics by hand.