Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Generative Modelling

What generative modelling aims to do

Typical uses: density estimation, outlier detection, representation learning, and as a foundation for conditional generation p(xy)p(x \mid y)

Generative modelling sits within unsupervised learning: we observe data xx without labels and learn hidden structure and dependencies.

Taxonomy of generative models

Explicit density models

Explicit density models define a parametric form for pθ(x)p_\theta(x):

Tractable density

Autoregressive: factorize with the chain rule:

p(x)=i=1Dp(xix<i).p(x) = \prod_{i=1}^{D} p(x_i \mid x_{<i}).

The Likelihood is exact and tractable.

Approximate density

  1. Variational inference: optimize a lower bound (ELBO) when pθ(x)p_\theta(x) is intractable to integrate.

  2. Flow-based: use invertible fθf_\theta to map xzx \leftrightarrow z so that

    logpX(x)=logpZ(fθ(x))+log ⁣detfθx.\log p_X(x) = \log p_Z(f_\theta(x)) + \log\!\left|\det \frac{\partial f_\theta}{\partial x}\right|.

Implicit density models

Implicit density models don’t give pθ(x)p_\theta(x) explicitly but can generate samples:

  1. GANs learn a sampler x=gθ(z)x=g_\theta(z) via adversarial training.

  2. Diffusion models learn to reverse a noising process; great sample quality but slow sampling.

Rules of Thumb

  • Variational Auto Encoders (VAE): Quick for sampling, easy to train, produce blurry images

  • GANs: Quick for sampling, hard to trian, produce high quality images

  • Diffusion models: Expensive to sample, easy to train, produce blurry images

Discriminative vs generative vs conditional generative

  • Discriminative models learn pθ(yx)p_\theta(y\mid x) (classification, detection, segmentation). Labels compete for probability mass.

  • Generative models learn pθ(x)p_\theta(x), where all possible inputs compete for mass (lets us reject unlikely inputs).

  • Conditional generative models learn pθ(xy)p_\theta(x\mid y); for each label yy, images compete for mass within that class.

Bayes Formula

P(θx)=P(xθ)P(x)P(θ)P(\theta \mid x) = \frac{P(x \mid \theta)}{P(x)}P(\theta)
  • Prior: P(θ)P(\theta), initial probability of the model

  • Likelihood: P(xθ)P(x \mid \theta), probability of the data given the model

  • Posterior: P(θx)P(\theta \mid x), probability of the model given the data

  • Marginal Likelihood / Evidence: P(x)P(x), total probability of the data under all possible models

Bayes Formula for Generative Models

P(xy)=P(yx)P(y)P(x)P(x \mid y) = \frac{P(y \mid x)}{P(y)}P(x)
  • Prior i.e. Generative Model: P(x)P(x), GAN, VAE, ...

  • Likelihood i.e. Discriminant model: P(yx)P(y \mid x), given the data how likely is a label

  • Posterior i.e. Conditional generative model: P(xy)P(x \mid y), probability of the data given a certain class

  • Marginal Likelihood / Evidence: P(y)P(y), frequency of occurence

Shannon Entropy

The Shannon Entropy of a discrete pp measures inherent uncertainty:

H(p)=xp(x)logp(x).H(p) = -\sum_{x} p(x)\,\log p(x).
  • Surprise (self-information) of an outcome: I(x)=logp(x)I(x) = -\log p(x)

    • rare outcomes ((p(x)(p(x) small, blue regions) are high-surprise

    • common outcomes (p(x)p(x) large, red regions) are low-surprise

  • Entropy (average uncertainty) of the distribution measures typical unpredictability across all outcomes: H(p)=Exp[logp(x)]H(p) = \mathbb{E}_{x\sim p}\big[-\log p(x)\big]

Left plot — High entropy: The distribution is broad/multimodal, so average unpredictability is high. A box over a blue, low-density region marks a high-surprise event; a box over a red, high-density region marks a low-surprise event.

Right plot — Low entropy: The distribution is sharply peaked; most mass sits near the center, so average unpredictability is low. Tail events would still be surprising, but they’re rare and contribute little to the average.

Cross-entropy

Cross-entropy between true pp and estimate qq:

H(p,q)=xp(x)logq(x).H(p, q) = -\sum_x p(x)\,\log q(x).

It is commonly used as a classification loss (negative log-likelihood).

KL divergence

KL divergence (not a metric):

DKL(pq)=xp(x)logp(x)q(x)  0,D_{\mathrm{KL}}(p\parallel q) = \sum_x p(x)\,\log\frac{p(x)}{q(x)} \;\ge 0,

asymmetric and violates triangle inequality. Minimizing DKL(pq)D_{\mathrm{KL}}(p\parallel q) fits qq to pp.

Likelihood Function

Gaussian model and maximum likelihood

Assume x1,,xNx_1,\dots,x_N are i.i.d. from N(μ,σ2)\mathcal{N}(\mu,\sigma^2).

  • Likelihood and log-likelihood:

    L(μ,σ2)=n=1NN(xnμ,σ2),(μ,σ2)=n=1NlogN(xnμ,σ2).\mathcal{L}(\mu,\sigma^2) = \prod_{n=1}^N \mathcal{N}(x_n\mid \mu,\sigma^2),\qquad \ell(\mu,\sigma^2)=\sum_{n=1}^N \log \mathcal{N}(x_n\mid \mu,\sigma^2).
  • Maximal Likelihood Estimate (MLE) (set derivatives to zero):

    μML=1Nn=1Nxn,σML2=1Nn=1N(xnμML)2.\mu_{\text{ML}} = \frac{1}{N}\sum_{n=1}^N x_n,\qquad \sigma^2_{\text{ML}} = \frac{1}{N}\sum_{n=1}^N (x_n-\mu_{\text{ML}})^2.

From single Gaussians to mixtures

A single Gaussian fails on multi-modal data. Increase flexibility with a mixture of KK Gaussians:

p(x)=k=1KπkN(xμk,Σk),kπk=1, πk0.p(x) = \sum_{k=1}^{K} \pi_k\,\mathcal{N}(x\mid \mu_k,\Sigma_k),\qquad \sum_k \pi_k = 1,\ \pi_k\ge 0.

Direct MLE is hard because of the log-sum in (θ)=nlogkπkN(xnμk,Σk)\ell(\theta)=\sum_n\log\sum_k \pi_k \mathcal{N}(x_n\mid\mu_k,\Sigma_k).

Latent variables and EM for GMMs

Introduce latent assignments zn{1,,K}z_n\in\{1,\dots,K\} and work with responsibilities

γnkp(zn=kxn)=πkN(xnμk,Σk)j=1KπjN(xnμj,Σj).\gamma_{nk} \equiv p(z_n=k\mid x_n) = \frac{\pi_k\,\mathcal{N}(x_n\mid \mu_k,\Sigma_k)} {\sum_{j=1}^K \pi_j\,\mathcal{N}(x_n\mid \mu_j,\Sigma_j)}.

Expectation–Maximization (EM) iterates:

  • E-step: compute γnk\gamma_{nk} for all n,kn,k using current parameters.

  • M-step: update parameters using soft counts Nk=nγnkN_k=\sum_n \gamma_{nk}:

    μk1Nknγnkxn,Σk1Nknγnk(xnμk)(xnμk),πkNkN.\mu_k \leftarrow \frac{1}{N_k}\sum_{n} \gamma_{nk} x_n,\quad \Sigma_k \leftarrow \frac{1}{N_k}\sum_{n} \gamma_{nk}(x_n-\mu_k)(x_n-\mu_k)^\top,\quad \pi_k \leftarrow \frac{N_k}{N}.

    Each iteration does not decrease the data log-likelihood and converges to a local optimum.

Variational perspective and the ELBO

For latent-variable models, the marginal likelihood logpθ(x)=logpθ(x,z)dz\log p_\theta(x)=\log\int p_\theta(x,z)\,dz is intractable. Introduce a tractable qϕ(zx)q_\phi(z\mid x) and use:

logpθ(x)=Eqϕ(zx) ⁣[logpθ(x,z)logqϕ(zx)]ELBO L(θ,ϕ)+DKL ⁣(qϕ(zx)pθ(zx))0.\log p_\theta(x) = \underbrace{\mathbb{E}_{q_\phi(z\mid x)}\!\left[\log p_\theta(x,z)-\log q_\phi(z\mid x)\right]}_{\text{ELBO } \mathcal{L}(\theta,\phi)} + \underbrace{D_{\mathrm{KL}}\!\left(q_\phi(z\mid x)\,\Vert\,p_\theta(z\mid x)\right)}_{\ge 0}.

Thus L(θ,ϕ)logpθ(x)\mathcal{L}(\theta,\phi)\le \log p_\theta(x); maximizing the ELBO tightens the bound and implicitly makes qϕq_\phi approximate the true posterior.

Autoregressive density models (PixelRNN)

Use the chain rule to make likelihood tractable:

p(x)=i=1Dp(xix<i).p(x) = \prod_{i=1}^D p(x_i\mid x_{<i}).
  • Architecture idea: an RNN maintains a hidden state to summarize the context x<ix_{<i}.

  • Training: maximize log-likelihood (equivalently minimize cross-entropy). In practice use teacher forcing: feed the ground-truth prefix x<ix_{<i} when predicting xix_i to avoid error accumulation.

  • Inference: sample sequentially from p(xix<i)p(x_i\mid x_{<i}) (stochastic; slower than parallel decoders).

Pros

  • Exact likelihood (good for model comparison),

  • High-quality samples with sufficient capacity.

Cons

  • Limited parallelism; slow sampling,

  • Potential bias from raster order; capturing global context can be challenging.