Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Numerical Simulations, Markov Chain Monte Carlo

The computational curse of Bayesian statistics

In Bayesian inference we want the posterior distribution of parameters θ\theta given data dd:

p(θd)=p(dθ)p(θ)p(d).p(\theta \mid d) = \frac{p(d \mid \theta)\, p(\theta)}{p(d)}.

The denominator is the evidence (or marginal likelihood)

p(d)=p(dθ)p(θ)dθ,p(d) = \int p(d \mid \theta)\, p(\theta)\, d\theta,

or, for multiple parameters θ=(θ1,,θd)\theta = (\theta_1, \dots, \theta_d),

p(d)=Rdp(dθ1,,θd)p(θ1,,θd)dθ1dθd.p(d) = \int_{\mathbb{R}^d} p(d \mid \theta_1,\dots,\theta_d)\, p(\theta_1,\dots,\theta_d)\, d\theta_1 \cdots d\theta_d.

This integral is often not available in closed form except for special conjugate families. Reasons:

  • The prior p(θ)p(\theta) or likelihood p(dθ)p(d \mid \theta) may be multimodal, non-Gaussian, or non-parametric.

  • The parameter space is often high-dimensional (many parameters or latent variables).

  • Realistic models (e.g. hierarchical models) create complicated dependency structures.

Because of this, the central problem of practical Bayesian statistics is numerical approximation of the posterior.

Grid approximation and the curse of dimensionality

A simple numerical approach is grid approximation:

  1. Choose a grid of parameter values {θ(1),,θ(N)}\{\theta^{(1)}, \dots, \theta^{(N)}\}.

  2. Evaluate the unnormalized posterior

    p~(θ(i)d)=p(dθ(i))p(θ(i))\tilde p(\theta^{(i)} \mid d) = p(d \mid \theta^{(i)})\, p(\theta^{(i)})

    at each grid point.

  3. Normalize:

    p(θ(i)d)=p~(θ(i)d)j=1Np~(θ(j)d).p(\theta^{(i)} \mid d) = \frac{\tilde p(\theta^{(i)} \mid d)}{\sum_{j=1}^N \tilde p(\theta^{(j)} \mid d)}.

For a single parameter, a grid can work reasonably well.

However, for dd parameters, if we use NN grid points per dimension, we need

NdN^d

grid points in total.

This is the curse of dimensionality:

  • Work grows exponentially with dimension dd.

  • Even modest values (e.g. N=100N=100, d=6d=6) become infeasible: 1006=1012100^6 = 10^{12} grid points.

Hence, naive grid approximation is only practical for very low-dimensional problems where we roughly know the region of interest in advance.

Monte Carlo methods and Monte Carlo integration

Monte Carlo methods use random samples to solve deterministic problems such as computing expectations or integrals.

Suppose we want to compute the expectation of a function f(θ)f(\theta) under a distribution with PDF p(θ)p(\theta):

Ep[f(θ)]=f(θ)p(θ)dθ.\mathbb{E}_{p}[f(\theta)] = \int f(\theta)\, p(\theta)\, d\theta.

If we can sample independent draws θ(1),,θ(N)p(θ)\theta^{(1)}, \dots, \theta^{(N)} \sim p(\theta), a Monte Carlo estimator is

E^p[f(θ)]=1Ni=1Nf(θ(i)).\widehat{\mathbb{E}}_{p}[f(\theta)] = \frac{1}{N} \sum_{i=1}^{N} f\big(\theta^{(i)}\big).

By the law of large numbers, this converges to the true expectation as NN \to \infty.

For Bayesian inference, we would love to sample from the posterior and approximate posterior expectations this way:

E[f(θ)d]=f(θ)p(θd)dθ    1Ni=1Nf(θ(i)),θ(i)p(θd).\mathbb{E}[f(\theta) \mid d] = \int f(\theta)\, p(\theta \mid d)\, d\theta \;\approx\; \frac{1}{N} \sum_{i=1}^{N} f\big(\theta^{(i)}\big), \quad \theta^{(i)} \sim p(\theta \mid d).

The problem: we do not know how to sample from p(θd)p(\theta \mid d) directly. Plain Monte Carlo integration with the prior p(θ)p(\theta) is often very inefficient, because most prior samples fall into regions of very low likelihood.

Markov processes and Markov chains

To construct better sampling algorithms, we introduce Markov processes.

A stochastic process {Xt}t=0\{X_t\}_{t=0}^{\infty} with values in some state space Ω\Omega is a Markov process if it satisfies the Markov property:

P(Xt+1=xt+1Xt=xt,Xt1=xt1,,X0=x0)=P(Xt+1=xt+1Xt=xt)P(X_{t+1} = x_{t+1} \mid X_t = x_t, X_{t-1} = x_{t-1}, \dots, X_0 = x_0) = P(X_{t+1} = x_{t+1} \mid X_t = x_t)

for all tt and states x0,,xt+1x_0,\dots,x_{t+1}.

For a discrete state space Ω={1,,K}\Omega = \{1,\dots,K\}, the process is characterized by:

  • Transition probabilities

    Pij=P(Xt+1=jXt=i),P_{ij} = P(X_{t+1} = j \mid X_t = i),
  • collected in a transition matrix PP with entries Pij0P_{ij} \ge 0 and jPij=1\sum_j P_{ij} = 1.

A sequence of states X0,X1,X2,X_0, X_1, X_2, \dots generated by such a process is called a Markov chain.

The key idea for MCMC is: construct a Markov chain whose stationary distribution is the posterior p(θd)p(\theta \mid d). Then, the chain will spend time in different regions of parameter space in proportion to their posterior probability.

Random walks as simple continuous Markov processes

A simple continuous-state Markov process is a random walk.

One-dimensional example:

θt+1=θt+εt,εtN(0,σ2),\theta_{t+1} = \theta_t + \varepsilon_t, \quad \varepsilon_t \sim \mathcal{N}(0, \sigma^2),

where each εt\varepsilon_t is an independent Gaussian “step”. The transition density (proposal) is

q(θθ)=N(θ;θ,σ2).q(\theta' \mid \theta) = \mathcal{N}(\theta'; \theta, \sigma^2).

Multivariate example in Rd\mathbb{R}^d:

θt+1=θt+εt,εtN(0,Σ),\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \boldsymbol{\varepsilon}_t, \quad \boldsymbol{\varepsilon}_t \sim \mathcal{N}(\mathbf{0}, \Sigma),

with covariance matrix Σ\Sigma controlling the shape and step sizes in different directions.

Random walks are easy to simulate, but by themselves they do not target the posterior. We need to bias the random walk so that it explores the posterior distribution.

The Metropolis algorithm

The Metropolis algorithm is a classic Markov Chain Monte Carlo (MCMC) method. It constructs a Markov chain {θt}\{\theta_t\} whose stationary distribution is the posterior p(θd)p(\theta \mid d).

We assume we can evaluate the unnormalized posterior

p~(θd)=p(dθ)p(θ),\tilde p(\theta \mid d) = p(d \mid \theta)\, p(\theta),

up to a constant factor (we do not need p(d)p(d)).

Algorithm outline:

  1. Initialization
    Choose an initial parameter value θ0\theta_0.

  2. For t=0,1,2,t = 0,1,2,\dots:

    (a) Propose a new state
    Sample a proposal

    θq(θθt),\theta' \sim q(\theta' \mid \theta_t),

    e.g. a random walk proposal q(θθt)=N(θ;θt,σ2)q(\theta' \mid \theta_t) = \mathcal{N}(\theta'; \theta_t, \sigma^2).

    (b) Compute acceptance probability
    For a symmetric proposal density (q(θθ)=q(θθ)q(\theta' \mid \theta) = q(\theta \mid \theta')), define

    α(θt,θ)=min ⁣(1,p~(θd)p~(θtd))=min ⁣(1,p(dθ)p(θ)p(dθt)p(θt)).\alpha(\theta_t, \theta') = \min\!\left( 1,\, \frac{\tilde p(\theta' \mid d)}{\tilde p(\theta_t \mid d)} \right) = \min\!\left( 1,\, \frac{p(d \mid \theta')\, p(\theta')}{p(d \mid \theta_t)\, p(\theta_t)} \right).

    (c) Accept or reject
    Draw uUniform(0,1)u \sim \mathrm{Uniform}(0,1) and set

    θt+1={θ,if uα(θt,θ),θt,otherwise.\theta_{t+1} = \begin{cases} \theta', & \text{if } u \le \alpha(\theta_t, \theta'), \\ \theta_t, & \text{otherwise.} \end{cases}

Intuition:

  • Moves to higher posterior density are always accepted (ratio 1\ge 1).

  • Moves to lower posterior density are sometimes accepted, with probability equal to the ratio of posterior densities.

  • The resulting Markov chain visits regions in proportion to their posterior probability mass.

Acceptance probability and detailed balance

Let π(θ)=p(θd)\pi(\theta) = p(\theta \mid d) denote the target posterior (up to normalization). The Metropolis algorithm defines transition probabilities

  • For θθ\theta' \ne \theta:

    T(θθ)=q(θθ)α(θ,θ),T(\theta \to \theta') = q(\theta' \mid \theta)\, \alpha(\theta, \theta'),
  • For staying at the same point:

    T(θθ)=1T(θθ)dθ.T(\theta \to \theta) = 1 - \int T(\theta \to \theta')\, d\theta'.

For a symmetric proposal density q(θθ)=q(θθ)q(\theta' \mid \theta) = q(\theta \mid \theta') and acceptance probability

α(θ,θ)=min ⁣(1,π(θ)π(θ)),\alpha(\theta, \theta') = \min\!\left( 1,\, \frac{\pi(\theta')}{\pi(\theta)} \right),

one can verify the detailed balance condition

π(θ)T(θθ)=π(θ)T(θθ)\pi(\theta)\, T(\theta \to \theta') = \pi(\theta')\, T(\theta' \to \theta)

for all θ,θ\theta, \theta'. Detailed balance implies that π(θ)\pi(\theta) is a stationary distribution of the Markov chain.

This is the mathematical reason why the Metropolis algorithm samples from the posterior in the long run.

Hastings extension for non-symmetric proposals

Hastings generalized the Metropolis algorithm to allow non-symmetric proposal distributions q(θθ)q(\theta' \mid \theta).

The modified acceptance probability is

α(θ,θ)=min ⁣(1,π(θ)q(θθ)π(θ)q(θθ))=min ⁣(1,p(dθ)p(θ)q(θθ)p(dθ)p(θ)q(θθ)).\alpha(\theta, \theta') = \min\!\left( 1,\, \frac{\pi(\theta')\, q(\theta \mid \theta')} {\pi(\theta)\, q(\theta' \mid \theta)} \right) = \min\!\left( 1,\, \frac{p(d \mid \theta')\, p(\theta')\, q(\theta \mid \theta')} {p(d \mid \theta)\, p(\theta)\, q(\theta' \mid \theta)} \right).

With this choice, detailed balance still holds:

π(θ)T(θθ)=π(θ)T(θθ),\pi(\theta)\, T(\theta \to \theta') = \pi(\theta')\, T(\theta' \to \theta),

where T(θθ)=q(θθ)α(θ,θ)T(\theta \to \theta') = q(\theta' \mid \theta) \alpha(\theta, \theta').

Special case:

  • If qq is symmetric, i.e. q(θθ)=q(θθ)q(\theta' \mid \theta) = q(\theta \mid \theta'), the qq-terms cancel and we recover the standard Metropolis acceptance rule.

Diagnosing MCMC: trace plots and autocorrelation

Because MCMC produces correlated samples, we must check whether our chain converged and mixes well.

Two basic diagnostic ideas:

  1. Trace plots
    Plot θt\theta_t against iteration tt. A well-mixing chain

    • explores all relevant regions of the posterior,

    • does not get “stuck” in a small sub-region,

    • does not show long-term trends after the burn-in phase.

  2. Autocorrelation function (ACF)
    For a (stationary) stochastic process {Xt}\{X_t\} with mean μ\mu and variance σ2\sigma^2, the theoretical autocorrelation at lag τ\tau is

    ρ(τ)=Cov(Xt,Xt+τ)Var(Xt)=E[(Xtμ)(Xt+τμ)]σ2.\rho(\tau) = \frac{\operatorname{Cov}(X_t, X_{t+\tau})}{\operatorname{Var}(X_t)} = \frac{\mathbb{E}\big[(X_t - \mu)(X_{t+\tau} - \mu)\big]}{\sigma^2}.

    Properties:

    • ρ(0)=1\rho(0) = 1 (perfect correlation with itself),

    • for a well-mixing chain, ρ(τ)\rho(\tau) should drop to near zero for relatively small lags τ\tau,

    • independent (“white noise”) samples have ρ(τ)=0\rho(\tau) = 0 for all τ>0\tau > 0.

In practice, we compute sample autocorrelations from the MCMC output and inspect how fast they decay.

Diagnostics: R^\hat R (Rhat)

A single chain can be misleading. A common practice is to run multiple chains from different starting points and compare them.

The Gelman–Rubin statistic (often denoted R^\hat R) summarizes how much the chains agree with each other.

Suppose we run mm chains, each of length nn, and let θjt\theta_{jt} be the tt-th draw from chain jj.

Define:

  • Chain means:

    θˉj=1nt=1nθjt,\bar{\theta}_{j\cdot} = \frac{1}{n} \sum_{t=1}^{n} \theta_{jt},
  • Overall mean:

    θˉ=1mj=1mθˉj,\bar{\theta}_{\cdot\cdot} = \frac{1}{m} \sum_{j=1}^{m} \bar{\theta}_{j\cdot},
  • Between-chain variance:

    B=nm1j=1m(θˉjθˉ)2,B = \frac{n}{m-1} \sum_{j=1}^{m} \big(\bar{\theta}_{j\cdot} - \bar{\theta}_{\cdot\cdot}\big)^2,
  • Within-chain variance:

    W=1mj=1m[1n1t=1n(θjtθˉj)2].W = \frac{1}{m} \sum_{j=1}^{m} \left[ \frac{1}{n-1} \sum_{t=1}^{n} \big(\theta_{jt} - \bar{\theta}_{j\cdot}\big)^2 \right].

An estimator of the marginal posterior variance is

Var^(θ)=n1nW+1nB.\widehat{\operatorname{Var}}(\theta) = \frac{n-1}{n}\, W + \frac{1}{n}\, B.

The potential scale reduction factor is

R^=Var^(θ)W.\hat R = \sqrt{ \frac{\widehat{\operatorname{Var}}(\theta)}{W} }.

Interpretation:

  • If all chains have mixed well and explore the same distribution, we expect BWB \approx W and thus R^1\hat R \approx 1.

  • Values substantially larger than 1 indicate disagreement between chains (lack of convergence).

  • A common rule of thumb: require R^<1.05\hat R < 1.05 for all monitored parameters.

Diagnostics: effective sample size (ESS)

Because MCMC draws are autocorrelated, NN draws from a chain contain less information than NN independent samples.

The effective sample size (ESS) quantifies how many independent samples the correlated draws are “worth”.

For a single chain of length NN with (theoretical) autocorrelations ρ(τ)\rho(\tau), the ESS for a scalar parameter can be approximated by

NeffN1+2τ=1ρ(τ).N_{\text{eff}} \approx \frac{N}{ 1 + 2 \sum_{\tau=1}^{\infty} \rho(\tau) }.
  • If the chain is nearly independent (autocorrelations near zero), then NeffNN_{\text{eff}} \approx N.

  • If the chain mixes slowly (large positive autocorrelations), NeffN_{\text{eff}} can be much smaller than NN.

A practical rule of thumb from the slides:

  • Aim for a ratio

    NeffN>0.1\frac{N_{\text{eff}}}{N} > 0.1

    (effective sample size larger than about 10% of the nominal number of draws).

In practice, modern software (PyMC, Stan, etc.) computes R^\hat R and ESS automatically for each parameter.

Advanced MCMC: Hamiltonian Monte Carlo (HMC)

The Metropolis algorithm is simple but can be very inefficient for complex or high-dimensional posteriors:

  • Proposals are uninformed about the shape of the posterior.

  • Large proposed steps are often rejected; small steps lead to strong autocorrelation and slow exploration.

  • Manual tuning of the proposal scale is required.

Hamiltonian Monte Carlo (HMC) addresses these issues by borrowing ideas from classical mechanics.

We introduce:

  • Position: θ\theta (the parameter vector),

  • Momentum: pp (an auxiliary variable),

  • Potential energy:

    U(θ)=logp(θd),U(\theta) = -\log p(\theta \mid d),
  • Kinetic energy (for a mass matrix MM):

    K(p)=12pM1p,K(p) = \frac{1}{2}\, p^\top M^{-1} p,
  • Hamiltonian (total energy):

    H(θ,p)=U(θ)+K(p).H(\theta, p) = U(\theta) + K(p).

The joint target distribution over (θ,p)(\theta, p) is

π(θ,p)exp(H(θ,p))=p(θd)exp ⁣(12pM1p).\pi(\theta, p) \propto \exp\big(-H(\theta, p)\big) = p(\theta \mid d)\, \exp\!\left(-\tfrac{1}{2} p^\top M^{-1} p\right).

Hamilton’s equations of motion are

dθdt=Hp=M1p,dpdt=Hθ=θU(θ).\frac{d\theta}{dt} = \phantom{-}\frac{\partial H}{\partial p} = M^{-1} p, \qquad \frac{dp}{dt} = -\frac{\partial H}{\partial \theta} = -\nabla_\theta U(\theta).

By approximately integrating these equations in time, HMC produces proposals that move along contours of constant energy, exploring the posterior in long, coherent trajectories with relatively low rejection rates.

HMC algorithm: leapfrog integration and Metropolis correction

A sketch of one HMC update step:

  1. Current state: θt\theta_t.

  2. Sample a fresh momentum

    ptN(0,M).p_t \sim \mathcal{N}(\mathbf{0}, M).
  3. Simulate Hamiltonian dynamics

    • Use a leapfrog integrator with step size ε\varepsilon for ss steps to map (θt,pt)(\theta_t, p_t) to a proposal (θ,p)(\theta', p').

    • The leapfrog integrator approximately conserves the Hamiltonian H(θ,p)H(\theta, p) while being reversible and volume-preserving.

  4. Metropolis acceptance step
    Accept or reject the proposal (θ,p)(\theta', p') with probability

    α=min ⁣(1,exp(H(θ,p)+H(θt,pt))).\alpha = \min\!\left( 1,\, \exp\big(-H(\theta', p') + H(\theta_t, p_t)\big) \right).
    • If accepted, set θt+1=θ\theta_{t+1} = \theta'.

    • If rejected, set θt+1=θt\theta_{t+1} = \theta_t.

Advantages over simple random-walk Metropolis:

  • Much longer steps in parameter space with relatively high acceptance rate.

  • Strongly reduced autocorrelation between successive samples (higher ESS for the same NN).

  • Better performance in higher dimensions and for correlated parameters.

However, HMC still requires tuning of step size ε\varepsilon, number of steps ss, and the mass matrix MM.

NUTS: the No-U-Turn Sampler

The No-U-Turn Sampler (NUTS) is an extension of HMC designed to automate much of the tuning:

  • It adaptively chooses the number of leapfrog steps ss so that trajectories do not “turn back” on themselves (no U-turns).

  • It also adapts the step size ε\varepsilon during a warm-up phase.

  • In practice, NUTS is a state-of-the-art general-purpose MCMC algorithm for many Bayesian models.

Most modern probabilistic programming frameworks (PyMC, Stan, etc.) use variants of HMC / NUTS as their default posterior sampling algorithms.

Probabilistic programming with PyMC

Probabilistic programming libraries automate many of the steps in Bayesian analysis:

  • You specify:

    • priors for parameters,

    • a likelihood model for the data,

  • the library handles:

    • construction of the joint and posterior distributions,

    • running MCMC (typically NUTS / HMC),

    • producing diagnostics (trace plots, R^\hat R, ESS),

    • summarizing posterior distributions.

In PyMC, the workflow is roughly:

  1. Define a model using with pm.Model(): and PyMC’s random variable primitives.

  2. Call pm.sample() to run MCMC and obtain a trace (posterior samples).

  3. Use tools like:

    • pm.plot_trace(...) for trace plots,

    • pm.plot_autocorr(...) for autocorrelation,

    • pm.rhat(...) for R^\hat R,

    • pm.ess(...) for effective sample size,

    • az.summary(...) (from ArviZ) for tabular summaries.

This allows you to focus on modeling rather than on implementing MCMC algorithms and diagnostics by hand.