Numerical Simulations, Markov Chain Monte Carlo

The computational curse of Bayesian statistics¶

In Bayesian inference we want the posterior distribution of parameters $\theta$ given data $d$ :

p(\theta \mid d) = \frac{p(d \mid \theta)\, p(\theta)}{p(d)}.

(1)

The denominator is the evidence (or marginal likelihood)

p(d) = \int p(d \mid \theta)\, p(\theta)\, d\theta,

(2)

or, for multiple parameters $\theta = (\theta_1, \dots, \theta_d)$ ,

p(d) = \int_{\mathbb{R}^d} p(d \mid \theta_1,\dots,\theta_d)\, p(\theta_1,\dots,\theta_d)\, d\theta_1 \cdots d\theta_d.

(3)

This integral is often not available in closed form except for special conjugate families. Reasons:

The prior $p(\theta)$ or likelihood $p(d \mid \theta)$ may be multimodal, non-Gaussian, or non-parametric.
The parameter space is often high-dimensional (many parameters or latent variables).
Realistic models (e.g. hierarchical models) create complicated dependency structures.

Because of this, the central problem of practical Bayesian statistics is numerical approximation of the posterior.

Grid approximation and the curse of dimensionality¶

A simple numerical approach is grid approximation:

Choose a grid of parameter values $\{\theta^{(1)}, \dots, \theta^{(N)}\}$ .
Evaluate the unnormalized posterior
$\tilde p(\theta^{(i)} \mid d) = p(d \mid \theta^{(i)})\, p(\theta^{(i)})$
(4)
at each grid point.
Normalize:
$p(\theta^{(i)} \mid d) = \frac{\tilde p(\theta^{(i)} \mid d)}{\sum_{j=1}^N \tilde p(\theta^{(j)} \mid d)}.$
(5)

For a single parameter, a grid can work reasonably well.

However, for $d$ parameters, if we use $N$ grid points per dimension, we need

N^d

(6)

grid points in total.

This is the curse of dimensionality:

Work grows exponentially with dimension $d$ .
Even modest values (e.g. $N=100$ , $d=6$ ) become infeasible: $100^6 = 10^{12}$ grid points.

Hence, naive grid approximation is only practical for very low-dimensional problems where we roughly know the region of interest in advance.

Monte Carlo methods and Monte Carlo integration¶

Monte Carlo methods use random samples to solve deterministic problems such as computing expectations or integrals.

Suppose we want to compute the expectation of a function $f(\theta)$ under a distribution with PDF $p(\theta)$ :

\mathbb{E}_{p}[f(\theta)] = \int f(\theta)\, p(\theta)\, d\theta.

(7)

If we can sample independent draws $\theta^{(1)}, \dots, \theta^{(N)} \sim p(\theta)$ , a Monte Carlo estimator is

\widehat{\mathbb{E}}_{p}[f(\theta)] = \frac{1}{N} \sum_{i=1}^{N} f\big(\theta^{(i)}\big).

(8)

By the law of large numbers, this converges to the true expectation as $N \to \infty$ .

For Bayesian inference, we would love to sample from the posterior and approximate posterior expectations this way:

\mathbb{E}[f(\theta) \mid d] = \int f(\theta)\, p(\theta \mid d)\, d\theta \;\approx\; \frac{1}{N} \sum_{i=1}^{N} f\big(\theta^{(i)}\big), \quad \theta^{(i)} \sim p(\theta \mid d).

(9)

The problem: we do not know how to sample from $p(\theta \mid d)$ directly. Plain Monte Carlo integration with the prior $p(\theta)$ is often very inefficient, because most prior samples fall into regions of very low likelihood.

Markov processes and Markov chains¶

To construct better sampling algorithms, we introduce Markov processes.

A stochastic process $\{X_t\}_{t=0}^{\infty}$ with values in some state space $\Omega$ is a Markov process if it satisfies the Markov property:

P(X_{t+1} = x_{t+1} \mid X_t = x_t, X_{t-1} = x_{t-1}, \dots, X_0 = x_0) = P(X_{t+1} = x_{t+1} \mid X_t = x_t)

(10)

for all $t$ and states $x_0,\dots,x_{t+1}$ .

For a discrete state space $\Omega = \{1,\dots,K\}$ , the process is characterized by:

Transition probabilities
$P_{ij} = P(X_{t+1} = j \mid X_t = i),$
(11)
collected in a transition matrix $P$ with entries $P_{ij} \ge 0$ and $\sum_j P_{ij} = 1$ .

A sequence of states $X_0, X_1, X_2, \dots$ generated by such a process is called a Markov chain.

The key idea for MCMC is: construct a Markov chain whose stationary distribution is the posterior $p(\theta \mid d)$ . Then, the chain will spend time in different regions of parameter space in proportion to their posterior probability.

Random walks as simple continuous Markov processes¶

A simple continuous-state Markov process is a random walk.

One-dimensional example:

\theta_{t+1} = \theta_t + \varepsilon_t, \quad \varepsilon_t \sim \mathcal{N}(0, \sigma^2),

(12)

where each $\varepsilon_t$ is an independent Gaussian “step”. The transition density (proposal) is

q(\theta' \mid \theta) = \mathcal{N}(\theta'; \theta, \sigma^2).

(13)

Multivariate example in $\mathbb{R}^d$ :

\boldsymbol{\theta}_{t+1} = \boldsymbol{\theta}_t + \boldsymbol{\varepsilon}_t, \quad \boldsymbol{\varepsilon}_t \sim \mathcal{N}(\mathbf{0}, \Sigma),

(14)

with covariance matrix $\Sigma$ controlling the shape and step sizes in different directions.

Random walks are easy to simulate, but by themselves they do not target the posterior. We need to bias the random walk so that it explores the posterior distribution.

The Metropolis algorithm¶

The Metropolis algorithm is a classic Markov Chain Monte Carlo (MCMC) method. It constructs a Markov chain $\{\theta_t\}$ whose stationary distribution is the posterior $p(\theta \mid d)$ .

We assume we can evaluate the unnormalized posterior

\tilde p(\theta \mid d) = p(d \mid \theta)\, p(\theta),

(15)

up to a constant factor (we do not need $p(d)$ ).

Algorithm outline:

Initialization
Choose an initial parameter value $\theta_0$ .
For $t = 0,1,2,\dots$ :
(a) Propose a new state
Sample a proposal
$\theta' \sim q(\theta' \mid \theta_t),$
(16)
e.g. a random walk proposal $q(\theta' \mid \theta_t) = \mathcal{N}(\theta'; \theta_t, \sigma^2)$ .
(b) Compute acceptance probability
For a symmetric proposal density ( $q(\theta' \mid \theta) = q(\theta \mid \theta')$ ), define
$\alpha(\theta_t, \theta') = \min\!\left( 1,\, \frac{\tilde p(\theta' \mid d)}{\tilde p(\theta_t \mid d)} \right) = \min\!\left( 1,\, \frac{p(d \mid \theta')\, p(\theta')}{p(d \mid \theta_t)\, p(\theta_t)} \right).$
(17)
(c) Accept or reject
Draw $u \sim \mathrm{Uniform}(0,1)$ and set
$\theta_{t+1} = \begin{cases} \theta', & \text{if } u \le \alpha(\theta_t, \theta'), \\ \theta_t, & \text{otherwise.} \end{cases}$
(18)

Intuition:

Moves to higher posterior density are always accepted (ratio $\ge 1$ ).
Moves to lower posterior density are sometimes accepted, with probability equal to the ratio of posterior densities.
The resulting Markov chain visits regions in proportion to their posterior probability mass.

Acceptance probability and detailed balance¶

Let $\pi(\theta) = p(\theta \mid d)$ denote the target posterior (up to normalization). The Metropolis algorithm defines transition probabilities

For $\theta' \ne \theta$ :
$T(\theta \to \theta') = q(\theta' \mid \theta)\, \alpha(\theta, \theta'),$
(19)
For staying at the same point:
$T(\theta \to \theta) = 1 - \int T(\theta \to \theta')\, d\theta'.$
(20)

For a symmetric proposal density $q(\theta' \mid \theta) = q(\theta \mid \theta')$ and acceptance probability

\alpha(\theta, \theta') = \min\!\left( 1,\, \frac{\pi(\theta')}{\pi(\theta)} \right),

(21)

one can verify the detailed balance condition

\pi(\theta)\, T(\theta \to \theta') = \pi(\theta')\, T(\theta' \to \theta)

(22)

for all $\theta, \theta'$ . Detailed balance implies that $\pi(\theta)$ is a stationary distribution of the Markov chain.

This is the mathematical reason why the Metropolis algorithm samples from the posterior in the long run.

Hastings extension for non-symmetric proposals¶

Hastings generalized the Metropolis algorithm to allow non-symmetric proposal distributions $q(\theta' \mid \theta)$ .

The modified acceptance probability is

\alpha(\theta, \theta') = \min\!\left( 1,\, \frac{\pi(\theta')\, q(\theta \mid \theta')} {\pi(\theta)\, q(\theta' \mid \theta)} \right) = \min\!\left( 1,\, \frac{p(d \mid \theta')\, p(\theta')\, q(\theta \mid \theta')} {p(d \mid \theta)\, p(\theta)\, q(\theta' \mid \theta)} \right).

(23)

With this choice, detailed balance still holds:

\pi(\theta)\, T(\theta \to \theta') = \pi(\theta')\, T(\theta' \to \theta),

(24)

where $T(\theta \to \theta') = q(\theta' \mid \theta) \alpha(\theta, \theta')$ .

Special case:

If $q$ is symmetric, i.e. $q(\theta' \mid \theta) = q(\theta \mid \theta')$ , the $q$ -terms cancel and we recover the standard Metropolis acceptance rule.

Diagnosing MCMC: trace plots and autocorrelation¶

Because MCMC produces correlated samples, we must check whether our chain converged and mixes well.

Two basic diagnostic ideas:

Trace plots
Plot $\theta_t$ against iteration $t$ . A well-mixing chain
- explores all relevant regions of the posterior,
- does not get “stuck” in a small sub-region,
- does not show long-term trends after the burn-in phase.
Autocorrelation function (ACF)
For a (stationary) stochastic process $\{X_t\}$ with mean $\mu$ and variance $\sigma^2$ , the theoretical autocorrelation at lag $\tau$ is
$\rho(\tau) = \frac{\operatorname{Cov}(X_t, X_{t+\tau})}{\operatorname{Var}(X_t)} = \frac{\mathbb{E}\big[(X_t - \mu)(X_{t+\tau} - \mu)\big]}{\sigma^2}.$
(25)
Properties:
- $\rho(0) = 1$ (perfect correlation with itself),
- for a well-mixing chain, $\rho(\tau)$ should drop to near zero for relatively small lags $\tau$ ,
- independent (“white noise”) samples have $\rho(\tau) = 0$ for all $\tau > 0$ .

In practice, we compute sample autocorrelations from the MCMC output and inspect how fast they decay.

Diagnostics: $\hat R$ (Rhat)¶

A single chain can be misleading. A common practice is to run multiple chains from different starting points and compare them.

The Gelman–Rubin statistic (often denoted $\hat R$ ) summarizes how much the chains agree with each other.

Suppose we run $m$ chains, each of length $n$ , and let $\theta_{jt}$ be the $t$ -th draw from chain $j$ .

Define:

Chain means:
$\bar{\theta}_{j\cdot} = \frac{1}{n} \sum_{t=1}^{n} \theta_{jt},$
(26)
Overall mean:
$\bar{\theta}_{\cdot\cdot} = \frac{1}{m} \sum_{j=1}^{m} \bar{\theta}_{j\cdot},$
(27)
Between-chain variance:
$B = \frac{n}{m-1} \sum_{j=1}^{m} \big(\bar{\theta}_{j\cdot} - \bar{\theta}_{\cdot\cdot}\big)^2,$
(28)
Within-chain variance:
$W = \frac{1}{m} \sum_{j=1}^{m} \left[ \frac{1}{n-1} \sum_{t=1}^{n} \big(\theta_{jt} - \bar{\theta}_{j\cdot}\big)^2 \right].$
(29)

An estimator of the marginal posterior variance is

\widehat{\operatorname{Var}}(\theta) = \frac{n-1}{n}\, W + \frac{1}{n}\, B.

(30)

The potential scale reduction factor is

\hat R = \sqrt{ \frac{\widehat{\operatorname{Var}}(\theta)}{W} }.

(31)

Interpretation:

If all chains have mixed well and explore the same distribution, we expect $B \approx W$ and thus $\hat R \approx 1$ .
Values substantially larger than 1 indicate disagreement between chains (lack of convergence).
A common rule of thumb: require $\hat R < 1.05$ for all monitored parameters.

Diagnostics: effective sample size (ESS)¶

Because MCMC draws are autocorrelated, $N$ draws from a chain contain less information than $N$ independent samples.

The effective sample size (ESS) quantifies how many independent samples the correlated draws are “worth”.

For a single chain of length $N$ with (theoretical) autocorrelations $\rho(\tau)$ , the ESS for a scalar parameter can be approximated by

N_{\text{eff}} \approx \frac{N}{ 1 + 2 \sum_{\tau=1}^{\infty} \rho(\tau) }.

(32)

If the chain is nearly independent (autocorrelations near zero), then $N_{\text{eff}} \approx N$ .
If the chain mixes slowly (large positive autocorrelations), $N_{\text{eff}}$ can be much smaller than $N$ .

A practical rule of thumb from the slides:

Aim for a ratio
$\frac{N_{\text{eff}}}{N} > 0.1$
(33)
(effective sample size larger than about 10% of the nominal number of draws).

In practice, modern software (PyMC, Stan, etc.) computes $\hat R$ and ESS automatically for each parameter.

Advanced MCMC: Hamiltonian Monte Carlo (HMC)¶

The Metropolis algorithm is simple but can be very inefficient for complex or high-dimensional posteriors:

Proposals are uninformed about the shape of the posterior.
Large proposed steps are often rejected; small steps lead to strong autocorrelation and slow exploration.
Manual tuning of the proposal scale is required.

Hamiltonian Monte Carlo (HMC) addresses these issues by borrowing ideas from classical mechanics.

We introduce:

Position: $\theta$ (the parameter vector),
Momentum: $p$ (an auxiliary variable),
Potential energy:
$U(\theta) = -\log p(\theta \mid d),$
(34)
Kinetic energy (for a mass matrix $M$ ):
$K(p) = \frac{1}{2}\, p^\top M^{-1} p,$
(35)
Hamiltonian (total energy):
$H(\theta, p) = U(\theta) + K(p).$
(36)

The joint target distribution over $(\theta, p)$ is

\pi(\theta, p) \propto \exp\big(-H(\theta, p)\big) = p(\theta \mid d)\, \exp\!\left(-\tfrac{1}{2} p^\top M^{-1} p\right).

(37)

Hamilton’s equations of motion are

\frac{d\theta}{dt} = \phantom{-}\frac{\partial H}{\partial p} = M^{-1} p, \qquad \frac{dp}{dt} = -\frac{\partial H}{\partial \theta} = -\nabla_\theta U(\theta).

(38)

By approximately integrating these equations in time, HMC produces proposals that move along contours of constant energy, exploring the posterior in long, coherent trajectories with relatively low rejection rates.

HMC algorithm: leapfrog integration and Metropolis correction¶

A sketch of one HMC update step:

Current state: $\theta_t$ .
Sample a fresh momentum
$p_t \sim \mathcal{N}(\mathbf{0}, M).$
(39)
Simulate Hamiltonian dynamics
- Use a leapfrog integrator with step size $\varepsilon$ for $s$ steps to map $(\theta_t, p_t)$ to a proposal $(\theta', p')$ .
- The leapfrog integrator approximately conserves the Hamiltonian $H(\theta, p)$ while being reversible and volume-preserving.
Metropolis acceptance step
Accept or reject the proposal $(\theta', p')$ with probability
$\alpha = \min\!\left( 1,\, \exp\big(-H(\theta', p') + H(\theta_t, p_t)\big) \right).$
(40)
- If accepted, set $\theta_{t+1} = \theta'$ .
- If rejected, set $\theta_{t+1} = \theta_t$ .

Advantages over simple random-walk Metropolis:

Much longer steps in parameter space with relatively high acceptance rate.
Strongly reduced autocorrelation between successive samples (higher ESS for the same $N$ ).
Better performance in higher dimensions and for correlated parameters.

However, HMC still requires tuning of step size $\varepsilon$ , number of steps $s$ , and the mass matrix $M$ .

NUTS: the No-U-Turn Sampler¶

The No-U-Turn Sampler (NUTS) is an extension of HMC designed to automate much of the tuning:

It adaptively chooses the number of leapfrog steps $s$ so that trajectories do not “turn back” on themselves (no U-turns).
It also adapts the step size $\varepsilon$ during a warm-up phase.
In practice, NUTS is a state-of-the-art general-purpose MCMC algorithm for many Bayesian models.

Most modern probabilistic programming frameworks (PyMC, Stan, etc.) use variants of HMC / NUTS as their default posterior sampling algorithms.

Probabilistic programming with PyMC¶

Probabilistic programming libraries automate many of the steps in Bayesian analysis:

You specify:
- priors for parameters,
- a likelihood model for the data,
the library handles:
- construction of the joint and posterior distributions,
- running MCMC (typically NUTS / HMC),
- producing diagnostics (trace plots, $\hat R$ , ESS),
- summarizing posterior distributions.

In PyMC, the workflow is roughly:

Define a model using with pm.Model(): and PyMC’s random variable primitives.
Call pm.sample() to run MCMC and obtain a trace (posterior samples).
Use tools like:
- pm.plot_trace(...) for trace plots,
- pm.plot_autocorr(...) for autocorrelation,
- pm.rhat(...) for $\hat R$ ,
- pm.ess(...) for effective sample size,
- az.summary(...) (from ArviZ) for tabular summaries.

This allows you to focus on modeling rather than on implementing MCMC algorithms and diagnostics by hand.