Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Hierarchical Models

Probabilistic graphical model notation

Probabilistic graphical models (PGMs) are graphs that encode conditional dependence structure between variables.

Basic notation from the slides:

  • Observed variable: usually drawn as a shaded node.

  • Latent (unobserved) variable: drawn as an unshaded node.

  • Deterministic variable: drawn as a node with a different border (function of parents).

  • Repeated structure (“plate”): a box with an index indicating repetition, e.g. over observations or groups.

Rule of thumb:

Every unobserved variable that has no incoming arrows needs a prior.

Bayesian inference then provides posterior distributions for all unobserved variables.

Canonical examples:

  • Beta–binomial model:

    • Prior:

      πBeta(α,β)\pi \sim \operatorname{Beta}(\alpha,\beta)
    • Likelihood:

      yπBin(n,π)y \mid \pi \sim \operatorname{Bin}(n,\pi)
  • Gamma–Poisson model:

    • Prior:

      λGamma(s,r)\lambda \sim \operatorname{Gamma}(s,r)
    • Likelihood:

      yλPois(λ)y \mid \lambda \sim \operatorname{Pois}(\lambda)
  • Normal likelihood with unknown mean and variance:

    • Priors:

      μsome prior,σsome positive prior,\mu \sim \text{some prior}, \qquad \sigma \sim \text{some positive prior},
    • Likelihood (for data points yiy_i):

      yiμ,σN(μ,σ2).y_i \mid \mu,\sigma \sim \mathcal{N}(\mu,\sigma^2).
  • Simple linear regression:

    • Parameters: intercept β0\beta_0, slope β1\beta_1, noise σ\sigma.

    • Mean function:

      μi=β0+β1xi.\mu_i = \beta_0 + \beta_1 x_i.
    • Likelihood:

      yiβ0,β1,σ,xiN(μi,σ2).y_i \mid \beta_0,\beta_1,\sigma,x_i \sim \mathcal{N}(\mu_i,\sigma^2).

In all cases, the PGM makes explicit which variables depend on which, and where priors must be specified.

Grouped data and hierarchical structure

Many real datasets have a grouped or multilevel structure.

Examples from the slides:

  • Cancer rates:

    • Counties nested within states, states nested within the USA.

  • Medical data:

    • Repeated measurements nested within patients, patients nested within a population.

Such data naturally form hierarchies:

  • Level 1: individual observations (e.g. county, weekly measurement),

  • Level 2: groups (e.g. state, patient),

  • Level 3: higher-level population (e.g. country, disease level).

We would like models that:

  • Respect the fact that observations within the same group are related,

  • Allow information sharing across groups,

  • Give reasonable predictions for groups with few data and even for new groups.

This motivates hierarchical models.

Modelling strategies for grouped data

For grouped data (e.g. cancer rates per county, grouped by state) the slides discuss three approaches:

  • Complete pooling

  • No pooling

  • Partial pooling (hierarchical modelling)

Each approach corresponds to a different assumption about how group means are related.

Complete pooling

Complete pooling ignores group structure and models all observations as if they came from the same distribution.

Example: cancer rates yiy_i for all counties in the US (ignoring states):

yiμ,σyN(μ,σy2),i=1,,n.y_i \mid \mu,\sigma_y \sim \mathcal{N}(\mu,\sigma_y^2), \quad i = 1,\dots,n.

We put priors on μ\mu and σy\sigma_y and infer their posterior distributions.

Properties:

  • Very simple model (few parameters, here just μ\mu and σy\sigma_y).

  • Can estimate the overall mean (e.g. average cancer rate in the US).

  • Cannot say anything about differences between groups (states) because it ignores them.

  • Predictions for individual states are essentially the same (up to noise).

This is sometimes called a complete pooling model because it pools all groups into one.

No pooling

No pooling fits an independent model per group, treating each group as if it had nothing to do with other groups.

Example: cancer rates for counties in state jj:

yijμj,σjN(μj,σj2),i=1,,nj,y_{ij} \mid \mu_j,\sigma_j \sim \mathcal{N}(\mu_j,\sigma_j^2), \quad i = 1,\dots,n_j,

with priors

μjsome prior,σjsome prior,j=1,,J.\mu_j \sim \text{some prior}, \qquad \sigma_j \sim \text{some prior}, \quad j = 1,\dots,J.

Properties:

  • Very flexible (separate parameters for each group).

  • For JJ states and two parameters per state, we have roughly 2J2J parameters (e.g. 46×246 \times 2 in the slides).

  • Allows state-specific inference but

    • can overfit groups with few observations,

    • gives no clear estimate of the overall mean (country-level),

    • cannot predict for new groups (states with no data), since there is no shared structure.

No pooling ignores that groups are part of a larger population.

Partial pooling and hierarchical models

Partial pooling sits between complete pooling and no pooling.

Key idea:

  • Each group has its own parameter (e.g. state mean μj\mu_j),

  • These parameters are themselves assumed to come from a population distribution with its own hyperparameters.

Cancer example: a hierarchical normal model for state means.

Model:

  • County-level data (observations within states):

    yijμj,σyN(μj,σy2),i=1,,nj,  j=1,,J.y_{ij} \mid \mu_j,\sigma_y \sim \mathcal{N}(\mu_j,\sigma_y^2), \quad i = 1,\dots,n_j,\; j = 1,\dots,J.
  • State-level means:

    μjμ,σμN(μ,σμ2),j=1,,J.\mu_j \mid \mu,\sigma_\mu \sim \mathcal{N}(\mu,\sigma_\mu^2), \quad j = 1,\dots,J.
  • Hyperpriors (country-level):

    μsome prior,σμsome prior,σysome prior.\mu \sim \text{some prior}, \qquad \sigma_\mu \sim \text{some prior}, \qquad \sigma_y \sim \text{some prior}.

This is a hierarchical model (also called multilevel model).

Information flow:

  • County-level data inform their state-specific means μj\mu_j.

  • All state means jointly inform the hyperparameters (μ,σμ)(\mu,\sigma_\mu).

  • Hyperparameters “feed back” to group means, especially for groups with few data.

This leads to the phenomenon of shrinkage.

Shrinkage in hierarchical models

In a hierarchical normal model, posterior estimates of group means μj\mu_j are shrunk toward the overall mean μ\mu.

For a simple case with known σy\sigma_y and σμ\sigma_\mu, and njn_j observations in group jj, the posterior mean of μj\mu_j has the form

μ^jpost=wjyˉj+(1wj)μ,\hat{\mu}_j^{\text{post}} = w_j \,\bar{y}_j + (1 - w_j)\,\mu,

where

  • yˉj\bar{y}_j is the sample mean of group jj,

  • μ\mu is the global mean (hyperparameter),

  • wj(0,1)w_j \in (0,1) is a weight given by

    wj=nj/σy2nj/σy2+1/σμ2=njnj+σy2/σμ2.w_j = \frac{n_j / \sigma_y^2}{n_j / \sigma_y^2 + 1 / \sigma_\mu^2} = \frac{n_j}{n_j + \sigma_y^2 / \sigma_\mu^2}.

Interpretation:

  • If njn_j is large (many observations) or σy2\sigma_y^2 is small, then wj1w_j \approx 1 and μ^jpostyˉj\hat{\mu}_j^{\text{post}} \approx \bar{y}_j.

    The group mean relies mostly on its own data.

  • If njn_j is small or σμ2\sigma_\mu^2 is small (strong hyperprior), then wjw_j is smaller and μ^jpost\hat{\mu}_j^{\text{post}} is closer to the global mean μ\mu.

    The group mean is strongly shrunk toward the global mean.

Shrinkage is stronger when:

  • There are few observations in a group,

  • The group mean is far from the global mean,

  • The hyperprior variance σμ2\sigma_\mu^2 is small (more belief in a tight global distribution).

Between-group vs within-group variability

In hierarchical models, it is useful to distinguish:

  • Within-group variability: how much observations vary around the group mean,

  • Between-group variability: how much group means vary around the global mean.

In the hierarchical normal model:

  • Within-group variance (county-level noise):

    Var(Yijμj)=σy2.\operatorname{Var}(Y_{ij} \mid \mu_j) = \sigma_y^2.
  • Between-group variance (variability of state means):

    Var(μj)=σμ2.\operatorname{Var}(\mu_j) = \sigma_\mu^2.

Using the law of total variance, total variance of YijY_{ij} (marginally over groups) can be decomposed as

Var(Yij)=E[Var(Yijμj)]+Var(E[Yijμj])=σy2+σμ2.\operatorname{Var}(Y_{ij}) = \mathbb{E}[\operatorname{Var}(Y_{ij} \mid \mu_j)] + \operatorname{Var}(\mathbb{E}[Y_{ij} \mid \mu_j]) = \sigma_y^2 + \sigma_\mu^2.

Interpretation:

  • σy2\sigma_y^2 is the typical variability within states (across counties).

  • σμ2\sigma_\mu^2 is the variability between state means.

  • The relative sizes of these variances indicate how much of the total variability is due to differences between states versus differences within states.

The slides illustrate this decomposition visually using histograms and density plots from Bayes Rules!.

Predictions for new groups in hierarchical models

One advantage of hierarchical models is the ability to make predictions for new groups (e.g. a state with no data).

For a new state jj^\ast with no observed data:

  • The prior for its mean is the population distribution:

    μjμ,σμN(μ,σμ2).\mu_{j^\ast} \mid \mu,\sigma_\mu \sim \mathcal{N}(\mu,\sigma_\mu^2).
  • For a new county in this new state, the predictive distribution is

    Ynewμ,σμ,σyN(μ,  σμ2+σy2).Y_{\text{new}} \mid \mu,\sigma_\mu,\sigma_y \sim \mathcal{N}(\mu,\; \sigma_\mu^2 + \sigma_y^2).

The variance is larger because we are uncertain both about:

  • The state-level mean μj\mu_{j^\ast} (epistemic uncertainty at group level),

  • The county-level noise σy2\sigma_y^2 (aleatoric uncertainty within groups).

This explains why predictions for states like “Kansas” (not in the dataset) have visibly wider uncertainty bands in the slides.

Hierarchical linear regression

The same hierarchical ideas extend naturally to regression.

Motivating example from the slides: pulmonary fibrosis progression.

  • Repeated lung volume measurements yijy_{ij} for patient jj at time xijx_{ij}.

  • We expect approximately linear decline per patient, but

    • each patient has their own baseline lung volume,

    • each patient has their own progression rate (slope).

We therefore build a random intercept and slope model:

  • Observation model:

    yijβ0j,β1j,σy,xijN(β0j+β1jxij,  σy2),y_{ij} \mid \beta_{0j},\beta_{1j},\sigma_y,x_{ij} \sim \mathcal{N}\big(\beta_{0j} + \beta_{1j} x_{ij},\; \sigma_y^2\big),

    for i=1,,nji = 1,\dots,n_j, j=1,,Jj = 1,\dots,J.

  • Patient-level parameters:

    β0jβ0,σ0N(β0,σ02),β1jβ1,σ1N(β1,σ12).\beta_{0j} \mid \beta_0,\sigma_0 \sim \mathcal{N}(\beta_0,\sigma_0^2), \qquad \beta_{1j} \mid \beta_1,\sigma_1 \sim \mathcal{N}(\beta_1,\sigma_1^2).
  • Hyperpriors:

    β0,β1,σ0,σ1,σypriors on appropriate supports.\beta_0,\beta_1,\sigma_0,\sigma_1,\sigma_y \sim \text{priors on appropriate supports}.

Here:

  • β0\beta_0 and β1\beta_1 describe the global disease level:

    • typical baseline lung volume,

    • typical decline rate.

  • β0j\beta_{0j} and β1j\beta_{1j} describe patient-level deviations around these global averages.

This is a hierarchical linear regression model.

Shrinkage in random intercept and slope models

As in simpler hierarchical models, the random intercept and slope model exhibits shrinkage:

  • Intercepts β0j\beta_{0j} are shrunk toward the global intercept β0\beta_0.

  • Slopes β1j\beta_{1j} are shrunk toward the global slope β1\beta_1.

Intuitively:

  • Patients with many measurements and clear trends have patient-specific estimates dominated by their own data.

  • Patients with few measurements or noisy data have intercepts and slopes that are pulled more strongly toward the global means.

This gives:

  • More bias for poorly observed patients (we borrow strength from the population),

  • But less variance across patient-specific estimates compared to fitting separate regression lines per patient.

The slides illustrate this with a subsample of patients: the more uncertain the individual slope, the more it is shrunk toward the global mean slope.

Hierarchical regression with Bambi

Bambi provides a convenient interface for fitting random intercept and slope models.

Conceptually, a Bambi model like

bmb.Model("FVC ~ weeks + (weeks | patient_id)", data=data)

implements the hierarchical structure:

  • Fixed (global) effects: overall intercept and slope (disease level),

  • Random (group-specific) effects: patient-specific intercepts and slopes.

Under the hood, Bambi builds a PyMC model with priors on:

  • Global coefficients,

  • Group-level standard deviations (for intercepts and slopes),

  • Residual standard deviation σy\sigma_y of the observations.

Fitting such a model can be numerically challenging:

  • More parameters and more complex posterior geometries,

  • Potential issues like divergences or low effective sample size,

  • Often need more tuning samples or higher target acceptance rates.

This is why the slides emphasize careful diagnostics when using such models.

Adding group-level predictors

Hierarchical models can include group-level predictors to explain variation in intercepts and slopes.

Example (pulmonary fibrosis):

  • Patient-level covariates: age, sex, smoking status.

  • These can be used to explain differences in initial lung volume (intercept) and progression rate (slope).

One way to write this is:

  • Intercept model: $$ \beta_{0j} = \gamma_{00} + \gamma_{01} ,\text{age}j + \gamma{02} ,\text{male}_j

    • \gamma_{03} ,\text{smoker}j + u{0j}, $$

  • Slope model: $$ \beta_{1j} = \gamma_{10} + \gamma_{11} ,\text{age}j + \gamma{12} ,\text{male}_j

    • \gamma_{13} ,\text{smoker}j + u{1j}, $$

with random effects u0ju_{0j} and u1ju_{1j} having their own prior distributions.

Interpretation:

  • γ\gamma’s describe how group-level covariates (e.g. age, sex, smoking) affect baseline and trend.

  • The uu’s capture remaining unexplained patient-level variation.

The slides show that some predictors (e.g. sex, age) may affect intercepts strongly, but not slopes, and warn that these relationships can be distorted by collider effects.

Collider effect and caution in causal interpretation

The slides end with an important caveat: hierarchical models fitted to observational data can show spurious associations due to the collider effect.

Example causal diagram:

SmokingPulmonary FibrosisGenetic predisposition.\text{Smoking} \longrightarrow \text{Pulmonary Fibrosis} \longleftarrow \text{Genetic predisposition}.

Here, pulmonary fibrosis is a collider: it has two incoming arrows.

If we condition on having pulmonary fibrosis (i.e. restrict the dataset to patients with the disease), then:

  • People who do not smoke but still have pulmonary fibrosis are more likely to have a strong genetic predisposition.

  • Among patients with the disease, “not smoking” may appear to be associated with “worse genetics”.

This can create the illusion that smoking is protective, even though it is not.

Key message:

  • Hierarchical models help share information and manage uncertainty, but

  • Causal interpretation requires explicit causal reasoning and careful attention to colliders, confounders, and selection bias.

This motivates the topic of causal modelling / Bayesian networks, which is introduced in the following weeks.

Back to the Week 1 motivation

The final slides briefly revisit the motivation for Bayesian methods:

Situations where the Bayesian view is particularly useful:

  • Quantifying uncertainty is central to the problem.

  • Only limited data are available.

  • Prior knowledge needs to be formally incorporated.

  • The model has a graphical / network structure (as in hierarchical models).

Situations where the frequentist view may be perfectly adequate:

  • Abundant data and simple models.

  • Prior information is weak, controversial, or not crucial.

  • Computational simplicity and speed outweigh the benefits of full posterior inference.

In the words (quoted in the slides) of Richard McElreath:

You don’t have to use a chainsaw to cut the birthday cake.

Bayesian hierarchical models are powerful tools—but they should be used where their additional complexity and richness actually help answer the scientific questions at hand.