Hierarchical Models

Probabilistic graphical model notation¶

Probabilistic graphical models (PGMs) are graphs that encode conditional dependence structure between variables.

Basic notation from the slides:

Observed variable: usually drawn as a shaded node.
Latent (unobserved) variable: drawn as an unshaded node.
Deterministic variable: drawn as a node with a different border (function of parents).
Repeated structure (“plate”): a box with an index indicating repetition, e.g. over observations or groups.

Rule of thumb:

Every unobserved variable that has no incoming arrows needs a prior.

Bayesian inference then provides posterior distributions for all unobserved variables.

Canonical examples:

Beta–binomial model:
- Prior:
  $\pi \sim \operatorname{Beta}(\alpha,\beta)$
  (1)
- Likelihood:
  $y \mid \pi \sim \operatorname{Bin}(n,\pi)$
  (2)
Gamma–Poisson model:
- Prior:
  $\lambda \sim \operatorname{Gamma}(s,r)$
  (3)
- Likelihood:
  $y \mid \lambda \sim \operatorname{Pois}(\lambda)$
  (4)
Normal likelihood with unknown mean and variance:
- Priors:
  $\mu \sim \text{some prior}, \qquad \sigma \sim \text{some positive prior},$
  (5)
- Likelihood (for data points $y_i$ ):
  $y_i \mid \mu,\sigma \sim \mathcal{N}(\mu,\sigma^2).$
  (6)
Simple linear regression:
- Parameters: intercept $\beta_0$ , slope $\beta_1$ , noise $\sigma$ .
- Mean function:
  $\mu_i = \beta_0 + \beta_1 x_i.$
  (7)
- Likelihood:
  $y_i \mid \beta_0,\beta_1,\sigma,x_i \sim \mathcal{N}(\mu_i,\sigma^2).$
  (8)

In all cases, the PGM makes explicit which variables depend on which, and where priors must be specified.

Grouped data and hierarchical structure¶

Many real datasets have a grouped or multilevel structure.

Examples from the slides:

Cancer rates:
- Counties nested within states, states nested within the USA.
Medical data:
- Repeated measurements nested within patients, patients nested within a population.

Such data naturally form hierarchies:

Level 1: individual observations (e.g. county, weekly measurement),
Level 2: groups (e.g. state, patient),
Level 3: higher-level population (e.g. country, disease level).

We would like models that:

Respect the fact that observations within the same group are related,
Allow information sharing across groups,
Give reasonable predictions for groups with few data and even for new groups.

This motivates hierarchical models.

Modelling strategies for grouped data¶

For grouped data (e.g. cancer rates per county, grouped by state) the slides discuss three approaches:

Complete pooling
No pooling
Partial pooling (hierarchical modelling)

Each approach corresponds to a different assumption about how group means are related.

Complete pooling¶

Complete pooling ignores group structure and models all observations as if they came from the same distribution.

Example: cancer rates $y_i$ for all counties in the US (ignoring states):

y_i \mid \mu,\sigma_y \sim \mathcal{N}(\mu,\sigma_y^2), \quad i = 1,\dots,n.

(9)

We put priors on $\mu$ and $\sigma_y$ and infer their posterior distributions.

Properties:

Very simple model (few parameters, here just $\mu$ and $\sigma_y$ ).
Can estimate the overall mean (e.g. average cancer rate in the US).
Cannot say anything about differences between groups (states) because it ignores them.
Predictions for individual states are essentially the same (up to noise).

This is sometimes called a complete pooling model because it pools all groups into one.

No pooling¶

No pooling fits an independent model per group, treating each group as if it had nothing to do with other groups.

Example: cancer rates for counties in state $j$ :

y_{ij} \mid \mu_j,\sigma_j \sim \mathcal{N}(\mu_j,\sigma_j^2), \quad i = 1,\dots,n_j,

(10)

with priors

\mu_j \sim \text{some prior}, \qquad \sigma_j \sim \text{some prior}, \quad j = 1,\dots,J.

(11)

Properties:

Very flexible (separate parameters for each group).
For $J$ states and two parameters per state, we have roughly $2J$ parameters (e.g. $46 \times 2$ in the slides).
Allows state-specific inference but
- can overfit groups with few observations,
- gives no clear estimate of the overall mean (country-level),
- cannot predict for new groups (states with no data), since there is no shared structure.

No pooling ignores that groups are part of a larger population.

Partial pooling and hierarchical models¶

Partial pooling sits between complete pooling and no pooling.

Key idea:

Each group has its own parameter (e.g. state mean $\mu_j$ ),
These parameters are themselves assumed to come from a population distribution with its own hyperparameters.

Cancer example: a hierarchical normal model for state means.

Model:

County-level data (observations within states):
$y_{ij} \mid \mu_j,\sigma_y \sim \mathcal{N}(\mu_j,\sigma_y^2), \quad i = 1,\dots,n_j,\; j = 1,\dots,J.$
(12)
State-level means:
$\mu_j \mid \mu,\sigma_\mu \sim \mathcal{N}(\mu,\sigma_\mu^2), \quad j = 1,\dots,J.$
(13)
Hyperpriors (country-level):
$\mu \sim \text{some prior}, \qquad \sigma_\mu \sim \text{some prior}, \qquad \sigma_y \sim \text{some prior}.$
(14)

This is a hierarchical model (also called multilevel model).

Information flow:

County-level data inform their state-specific means $\mu_j$ .
All state means jointly inform the hyperparameters $(\mu,\sigma_\mu)$ .
Hyperparameters “feed back” to group means, especially for groups with few data.

This leads to the phenomenon of shrinkage.

Shrinkage in hierarchical models¶

In a hierarchical normal model, posterior estimates of group means $\mu_j$ are shrunk toward the overall mean $\mu$ .

For a simple case with known $\sigma_y$ and $\sigma_\mu$ , and $n_j$ observations in group $j$ , the posterior mean of $\mu_j$ has the form

\hat{\mu}_j^{\text{post}} = w_j \,\bar{y}_j + (1 - w_j)\,\mu,

(15)

where

$\bar{y}_j$ is the sample mean of group $j$ ,
$\mu$ is the global mean (hyperparameter),
$w_j \in (0,1)$ is a weight given by
$w_j = \frac{n_j / \sigma_y^2}{n_j / \sigma_y^2 + 1 / \sigma_\mu^2} = \frac{n_j}{n_j + \sigma_y^2 / \sigma_\mu^2}.$
(16)

Interpretation:

If $n_j$ is large (many observations) or $\sigma_y^2$ is small, then $w_j \approx 1$ and $\hat{\mu}_j^{\text{post}} \approx \bar{y}_j$ .
The group mean relies mostly on its own data.
If $n_j$ is small or $\sigma_\mu^2$ is small (strong hyperprior), then $w_j$ is smaller and $\hat{\mu}_j^{\text{post}}$ is closer to the global mean $\mu$ .
The group mean is strongly shrunk toward the global mean.

Shrinkage is stronger when:

There are few observations in a group,
The group mean is far from the global mean,
The hyperprior variance $\sigma_\mu^2$ is small (more belief in a tight global distribution).

Between-group vs within-group variability¶

In hierarchical models, it is useful to distinguish:

Within-group variability: how much observations vary around the group mean,
Between-group variability: how much group means vary around the global mean.

In the hierarchical normal model:

Within-group variance (county-level noise):
$\operatorname{Var}(Y_{ij} \mid \mu_j) = \sigma_y^2.$
(17)
Between-group variance (variability of state means):
$\operatorname{Var}(\mu_j) = \sigma_\mu^2.$
(18)

Using the law of total variance, total variance of $Y_{ij}$ (marginally over groups) can be decomposed as

\operatorname{Var}(Y_{ij}) = \mathbb{E}[\operatorname{Var}(Y_{ij} \mid \mu_j)] + \operatorname{Var}(\mathbb{E}[Y_{ij} \mid \mu_j]) = \sigma_y^2 + \sigma_\mu^2.

(19)

Interpretation:

$\sigma_y^2$ is the typical variability within states (across counties).
$\sigma_\mu^2$ is the variability between state means.
The relative sizes of these variances indicate how much of the total variability is due to differences between states versus differences within states.

The slides illustrate this decomposition visually using histograms and density plots from Bayes Rules!.

Predictions for new groups in hierarchical models¶

One advantage of hierarchical models is the ability to make predictions for new groups (e.g. a state with no data).

For a new state $j^\ast$ with no observed data:

The prior for its mean is the population distribution:
$\mu_{j^\ast} \mid \mu,\sigma_\mu \sim \mathcal{N}(\mu,\sigma_\mu^2).$
(20)
For a new county in this new state, the predictive distribution is
$Y_{\text{new}} \mid \mu,\sigma_\mu,\sigma_y \sim \mathcal{N}(\mu,\; \sigma_\mu^2 + \sigma_y^2).$
(21)

The variance is larger because we are uncertain both about:

The state-level mean $\mu_{j^\ast}$ (epistemic uncertainty at group level),
The county-level noise $\sigma_y^2$ (aleatoric uncertainty within groups).

This explains why predictions for states like “Kansas” (not in the dataset) have visibly wider uncertainty bands in the slides.

Hierarchical linear regression¶

The same hierarchical ideas extend naturally to regression.

Motivating example from the slides: pulmonary fibrosis progression.

Repeated lung volume measurements $y_{ij}$ for patient $j$ at time $x_{ij}$ .
We expect approximately linear decline per patient, but
- each patient has their own baseline lung volume,
- each patient has their own progression rate (slope).

We therefore build a random intercept and slope model:

Observation model:
$y_{ij} \mid \beta_{0j},\beta_{1j},\sigma_y,x_{ij} \sim \mathcal{N}\big(\beta_{0j} + \beta_{1j} x_{ij},\; \sigma_y^2\big),$
(22)
for $i = 1,\dots,n_j$ , $j = 1,\dots,J$ .
Patient-level parameters:
$\beta_{0j} \mid \beta_0,\sigma_0 \sim \mathcal{N}(\beta_0,\sigma_0^2), \qquad \beta_{1j} \mid \beta_1,\sigma_1 \sim \mathcal{N}(\beta_1,\sigma_1^2).$
(23)
Hyperpriors:
$\beta_0,\beta_1,\sigma_0,\sigma_1,\sigma_y \sim \text{priors on appropriate supports}.$
(24)

Here:

$\beta_0$ and $\beta_1$ describe the global disease level:
- typical baseline lung volume,
- typical decline rate.
$\beta_{0j}$ and $\beta_{1j}$ describe patient-level deviations around these global averages.

This is a hierarchical linear regression model.

Shrinkage in random intercept and slope models¶

As in simpler hierarchical models, the random intercept and slope model exhibits shrinkage:

Intercepts $\beta_{0j}$ are shrunk toward the global intercept $\beta_0$ .
Slopes $\beta_{1j}$ are shrunk toward the global slope $\beta_1$ .

Intuitively:

Patients with many measurements and clear trends have patient-specific estimates dominated by their own data.
Patients with few measurements or noisy data have intercepts and slopes that are pulled more strongly toward the global means.

This gives:

More bias for poorly observed patients (we borrow strength from the population),
But less variance across patient-specific estimates compared to fitting separate regression lines per patient.

The slides illustrate this with a subsample of patients: the more uncertain the individual slope, the more it is shrunk toward the global mean slope.

Hierarchical regression with Bambi¶

Bambi provides a convenient interface for fitting random intercept and slope models.

Conceptually, a Bambi model like

bmb.Model("FVC ~ weeks + (weeks | patient_id)", data=data)

implements the hierarchical structure:

Fixed (global) effects: overall intercept and slope (disease level),
Random (group-specific) effects: patient-specific intercepts and slopes.

Under the hood, Bambi builds a PyMC model with priors on:

Global coefficients,
Group-level standard deviations (for intercepts and slopes),
Residual standard deviation $\sigma_y$ of the observations.

Fitting such a model can be numerically challenging:

More parameters and more complex posterior geometries,
Potential issues like divergences or low effective sample size,
Often need more tuning samples or higher target acceptance rates.

This is why the slides emphasize careful diagnostics when using such models.

Adding group-level predictors¶

Hierarchical models can include group-level predictors to explain variation in intercepts and slopes.

Example (pulmonary fibrosis):

Patient-level covariates: age, sex, smoking status.
These can be used to explain differences in initial lung volume (intercept) and progression rate (slope).

One way to write this is:

Intercept model: $$ \beta_{0j} = \gamma_{00} + \gamma_{01} ,\text{age}j + \gamma{02} ,\text{male}_j
- \gamma_{03} ,\text{smoker}j + u{0j}, $$
Slope model: $$ \beta_{1j} = \gamma_{10} + \gamma_{11} ,\text{age}j + \gamma{12} ,\text{male}_j
- \gamma_{13} ,\text{smoker}j + u{1j}, $$

with random effects $u_{0j}$ and $u_{1j}$ having their own prior distributions.

Interpretation:

$\gamma$ ’s describe how group-level covariates (e.g. age, sex, smoking) affect baseline and trend.
The $u$ ’s capture remaining unexplained patient-level variation.

The slides show that some predictors (e.g. sex, age) may affect intercepts strongly, but not slopes, and warn that these relationships can be distorted by collider effects.

Collider effect and caution in causal interpretation¶

The slides end with an important caveat: hierarchical models fitted to observational data can show spurious associations due to the collider effect.

Example causal diagram:

\text{Smoking} \longrightarrow \text{Pulmonary Fibrosis} \longleftarrow \text{Genetic predisposition}.

(25)

Here, pulmonary fibrosis is a collider: it has two incoming arrows.

If we condition on having pulmonary fibrosis (i.e. restrict the dataset to patients with the disease), then:

People who do not smoke but still have pulmonary fibrosis are more likely to have a strong genetic predisposition.
Among patients with the disease, “not smoking” may appear to be associated with “worse genetics”.

This can create the illusion that smoking is protective, even though it is not.

Key message:

Hierarchical models help share information and manage uncertainty, but
Causal interpretation requires explicit causal reasoning and careful attention to colliders, confounders, and selection bias.

This motivates the topic of causal modelling / Bayesian networks, which is introduced in the following weeks.

Back to the Week 1 motivation¶

The final slides briefly revisit the motivation for Bayesian methods:

Situations where the Bayesian view is particularly useful:

Quantifying uncertainty is central to the problem.
Only limited data are available.
Prior knowledge needs to be formally incorporated.
The model has a graphical / network structure (as in hierarchical models).

Situations where the frequentist view may be perfectly adequate:

Abundant data and simple models.
Prior information is weak, controversial, or not crucial.
Computational simplicity and speed outweigh the benefits of full posterior inference.

In the words (quoted in the slides) of Richard McElreath:

You don’t have to use a chainsaw to cut the birthday cake.

Bayesian hierarchical models are powerful tools—but they should be used where their additional complexity and richness actually help answer the scientific questions at hand.