Bayesian Networks - Causal Graphs
Causal AI and the limits of pure prediction¶
Many impressive machine learning systems (including large language models) excel at prediction:
Given , they can accurately predict .
However, prediction alone is not sufficient when we care about:
Interventions: What happens if we change ?
Policy decisions: Should we give a drug? Launch a campaign? Change a workflow?
Fairness and bias: Is a model discriminating against a particular group?
Explanation and trust: Why did this happen? What are the causes?
A central message of the slides:
Causality is the missing link between pattern recognition and reasoning.
Causal questions require knowledge about how variables influence each other, not just how they correlate in data.
Causality vs. correlation and Simpson’s paradox¶
The slides use examples like a drug study and exercise–cholesterol relationships to show:
Purely observational statistics can lead to paradoxes.
Aggregated data can suggest one conclusion, while stratified data suggest the opposite.
A classic example is Simpson’s paradox:
A treatment seems worse overall, but better for each subgroup (e.g. men and women separately).
The paradox is resolved when we realize that a confounder (e.g. gender) influences both:
treatment assignment (who gets the drug), and
outcome (recovery).
Key idea:
A confounder is a variable that is a common cause of both treatment and outcome.
To correctly assess the causal effect of a treatment, we must control for confounders.
Informally:
Causality is about understanding which variables produce changes in others, not just which ones move together.
Informal definition of causation¶
The slides give an intuitive definition:
A variable is a direct cause of a variable if appears in the function that assigns ’s value.
A variable is a cause of if it is a direct cause of , or a cause of some other variable that causes .
In other words:
If changing (while holding all other relevant inputs fixed) would change , then is a cause of .
Causal relations are fundamentally about what would happen under interventions, not just observed co-occurrence.
This motivates a more formal framework: structural causal models.
Structural Causal Models (SCMs)¶
A structural causal model (SCM) describes how variables in a system are generated from exogenous noise and other variables via structural equations.
Formal definition:
A structural causal model
consists of:
: endogenous variables
Determined within the model.
Each has at least one cause (either exogenous or endogenous).
: exogenous variables
External to the model; we do not model their causes.
Represent background factors and noise.
: structural functions
Each is determined by a function
where:
are the endogenous parents (causes) of ,
are the exogenous parents of .
: joint probability distribution over the exogenous variables .
Assumption in the slides:
The model is recursive (or acyclic): there are no feedback loops or cycles such that a variable is ultimately a cause of itself.
Example of structural equations¶
Example from the slides: exercise, diet, and fitness level.
Variables:
: amount of exercise,
: diet quality,
: fitness level.
Exogenous variables:
: external motivation affecting exercise,
: education / culture affecting diet,
: genetic predisposition and other unobserved factors affecting fitness.
Structural equations:
Interpretation:
and are causes of because they appear in the structural function .
, , and are exogenous and are not caused by any variable in .
Structural Equation Models (SEMs)¶
A structural equation model (SEM) is a special case of an SCM with:
A fixed causal ordering of variables.
Linear structural functions.
Typically Gaussian exogenous noise.
For example, with variables and exogenous errors :
Assumptions:
Exogenous errors are often taken to be independent Gaussians:
where is typically diagonal.
SEMs are widely used in fields like psychology and economics, where linear relationships and Gaussian noise are reasonable approximations.
Graphical causal models¶
Every SCM can be associated with a graphical causal model (a directed graph):
Nodes: variables in and .
Directed edges: from each parent to its child, according to the structural equations.
Graph construction:
For each structural equation
draw directed edges from each variable in and to .
Properties:
Exogenous variables () appear as root nodes (no parents).
Endogenous variables () are descendants of at least one exogenous variable.
In models considered here, the graph is a directed acyclic graph (DAG):
No cycles or self-loops.
You cannot follow directed edges and return to the same node.
Graphical definition of causation:
If there is a directed edge , then is a direct cause of .
If there is a directed path from to , then is a potential cause of .
Basics of graph terminology¶
Some basic concepts from graph theory used in the slides:
A graph consists of nodes (vertices) and edges (links).
Edges can be:
Undirected: no arrowheads, just a line between nodes.
Directed: arrows indicating direction.
In a directed graph:
If there is a directed edge :
is a parent of ,
is a child of .
If there is a directed path :
is an ancestor of ,
is a descendant of .
A Directed Acyclic Graph (DAG) is a directed graph with no cycles:
You cannot start at a node, follow directed edges, and come back to the same node.
DAGs are the graph type used in Bayesian networks and many causal models.
Why use graphical models instead of raw SCM equations?¶
SCMs specify exact functional relationships , which can be difficult to elicit and manipulate directly.
Graphical models offer:
A qualitative description of causal structure (who causes whom),
A compact way to encode conditional independencies between variables,
A way to factorize joint probability distributions into simpler pieces,
An intuitive visual language for communicating assumptions.
Typically:
We first think in terms of graphs / DAGs,
Then attach local probability distributions (Bayesian networks),
And sometimes further refine to structural equations when functional forms are needed.
Bayesian networks: definition¶
Let be random variables.
A Bayesian network is:
A DAG whose nodes are the variables .
For each node , a local conditional distribution
where are the parents of in the DAG.
The Bayesian network defines a joint distribution over all variables by the product
Key points:
For nodes with no parents, is just a marginal .
For nodes with parents, we specify conditional probabilities given the parent configuration.
The arrows in a probabilistic Bayesian network encode conditional dependence structure (not automatically causality).
The slides distinguish later between probabilistic Bayesian networks and causal Bayesian networks.
Explaining away with Bayesian networks¶
A classic structure from the slides:
: burglary (0/1),
: earthquake (0/1),
: alarm (0/1).
Graph:
Local distributions:
and for burglary and earthquake.
for the alarm, which depends on both and .
Joint factorization:
Explaining away:
Suppose we know (alarm goes off).
Burglary and earthquake are both possible causes.
If we now learn that (there is an earthquake), the probability of burglary decreases:
Intuition:
Before knowing , a burglary was a plausible explanation for the alarm.
Once we know there was an earthquake, the alarm is largely “explained” by , so a burglary becomes less likely.
This explaining away effect is typical for converging connections (called colliders) .
Remarks on explaining away¶
Important details emphasized in the slides:
Without conditioning on the effect , the causes and can be independent:
Once we condition on the effect , and become dependent:
This induced dependence through a common effect is central to understanding many probabilistic puzzles.
Takeaway:
In Bayesian networks, conditioning on a collider can create dependencies between its parents (explaining away).
Medical diagnosis example: cold or allergies?¶
The slides present a small Bayesian network for a toy medical diagnosis problem.
Variables:
: cold,
: allergies,
: cough,
: itchy eyes.
Graph structure:
,
,
.
Factorization of the joint distribution:
Given this model, we can ask questions like:
: probability of a cold given a cough.
: probability of a cold given both cough and itchy eyes.
Typically, we observe an explaining away effect:
Observing itchy eyes makes allergies more likely,
Which in turn reduces the probability that the cough is due to a cold.
Flu virus example and conditional independence¶
Another example in the slides:
Variables:
: ambient temperature,
: presence of flu virus,
: having flu.
Graph:
Interpretation:
Cold temperature influences the prevalence of the flu virus ,
The presence of the virus influences whether a person gets sick .
Key conditional independence:
When the virus status is known, temperature and sickness are conditionally independent:
The chain rule factorization (using the graph) is:
From this factorization we can compute:
Joint distributions over ,
Marginals like ,
Conditionals like using marginalization and Bayes’ rule.
For example,
Compactness of Bayesian networks¶
A full joint distribution over discrete variables typically requires specifying probabilities for all combinations of values.
If each variable has states, we need about numbers.
Bayesian networks exploit the graph structure to express the joint as
Each local distribution involves only a small subset of variables, greatly reducing the number of parameters needed.
Flu example:
so we only need:
A table for ,
A table for ,
A table for ,
instead of a full table for .
Benefits:
More compact representation,
Easier elicitation from domain experts,
More efficient computation and inference.
Probabilistic inference in Bayesian networks¶
General inference problem:
Given:
A Bayesian network defining ,
Evidence (observed values for a subset of variables ),
A query set ,
Compute:
often represented as a probability table over all values of .
Example query:
In the medical diagnosis network:
the probability of a cold given a cough and itchy eyes.
In principle, inference can be done by:
Forming the joint distribution via the product rule.
Summing out (marginalizing) non-query, non-evidence variables.
Normalizing to get conditional probabilities.
However:
Exact inference can be computationally expensive in large networks.
Many specialized algorithms (variable elimination, message passing, approximate methods) have been developed.
The slides focus more on the conceptual structure rather than algorithms, but the key idea is:
Once the graph and local conditional distributions are specified, any probability over variables can, in principle, be computed.
Probabilistic vs causal Bayesian networks vs SCMs¶
The slides conclude by contrasting three related but distinct notions:
Probabilistic Bayesian network
Arrows into a node indicate that the probability of is described by
The network defines a joint distribution purely as a factorizable probability model.
Arrows encode conditional dependencies, but not necessarily causal relations.
Causal Bayesian network
Same graphical structure, but arrows are interpreted causally.
Conditional probabilities represent the distribution of under interventions on its parents:
Still specifies probabilities, but with an explicit causal interpretation.
Structural causal model (SCM)
Consists of structural functions and a distribution over .
No explicit conditional probability tables at the level of the structural functions.
Causal semantics arise directly from the structural equations, and interventions correspond to modifying them.
In short:
SCMs provide the most detailed causal mechanism.
Causal Bayesian networks provide a graphical abstraction of an SCM.
Probabilistic Bayesian networks may have the same graphical structure but do not automatically encode causality.
Interventions (using Pearl’s do-calculus) and causal effects are treated in more detail in later lectures (Bayesian Networks II/III).
Exercises¶
This notebook contains the text of Series 1 – Bayesian Networks exercises, converted from the PDF into
Markdown with LaTeX formulas using the $...$ / $$...$$ notation.
You can use this as a starting point to write your own solutions in additional cells.
Problem 1.1 — Simpson’s paradox in batting averages¶
For baseball fans: the table below gives hits and at-bats for two players (David Justice and Derek Jeter) in three seasons.
| Player | 1995 | 1996 | 1997 | All three years |
|---|---|---|---|---|
| David Justice | ||||
| Derek Jeter |
Note the paradoxical pattern: Justice has the higher batting average in each of the three seasons (1995, 1996, 1997), yet Jeter has the higher batting average when the three seasons are combined. In this exercise you will explain qualitatively how this Simpson phenomenon can arise and (optionally) analyze the data quantitatively with Bayesian models.
(a) How can one player be a worse hitter than the other in 1995, 1996, and 1997 but better over the three-year period? Dissolve the paradox qualitatively.
(b) (Optional) Propose a hierarchical (exchangeable) Beta–Binomial model that
models season-level batting probabilities ,
allows pooling across seasons (and possibly across players), and
has weakly informative hyperpriors.
Compare the conclusions (probability statements, intervals) from the hierarchical model to the independent per-season and aggregated analyses. Which analysis do you find most persuasive for estimating future batting performance and why?
Problem 1.2 — Aggregated vs segregated data¶
In each of the following scenarios, you are asked to decide whether the aggregate data (pooled across all subgroups) or the segregated data (analyzed separately by relevant subgroups) should be used to infer the true causal effect.
Provide a short justification for each answer, explaining which type of data better represents the causal relationship of interest.
(a) Kidney stone treatments I.
In an observational study published in 1996, open surgery to remove kidney stones had a better success rate than
endoscopic surgery for small kidney stones. It also had a better success rate for large kidney stones. However, it had
a lower success rate overall. Dissolve the paradox.
(b) Kidney stone treatments II.
There are two treatments used on kidney stones: Treatment A and Treatment B. Doctors are more likely to use Treatment A
on large (and therefore more severe) stones and more likely to use Treatment B on small stones. Should a patient who
doesn’t know the size of his or her stone examine the general population data or the stone size-specific data
when deciding which treatment they would like to request?
(c) Smoking and thyroid disease survival.
A 1995 study on thyroid disease reported that smokers had a higher twenty-year survival rate (76%) than nonsmokers (69%).
However, when survival rates were analyzed within seven age groups, nonsmokers had higher survival in six of the seven
groups, with only a minimal difference in the remaining one. To assess the causal effect of smoking on survival, should
the analysis be based on the overall (aggregate) survival rates or on the age-specific (segregated) data?
(d) Surgical performance of two doctors.
In a small town, two doctors have each performed 100 surgeries, divided into two categories: one very difficult surgery
and one very easy surgery. Doctor 1 performs mostly easy surgeries, while Doctor 2 performs mostly difficult ones.
You need surgery but do not know whether your case will be easy or difficult. To maximize your chances of a successful
operation, should you compare the doctors’ overall (aggregate) success rates, or their success rates within each type
of surgery (segregated data)?
Problem 1.3 — Simpson’s reversal with a lollipop¶
In an attempt to estimate the effectiveness of a new drug, a randomized experiment is conducted. Half of the patients (50%) are assigned to receive the new drug, and the remaining half (50%) are assigned to receive a placebo.
A day before the actual experiment, a nurse distributes lollipops to some patients who show signs of depression. By coincidence, most of these patients happen to be in the treatment-bound ward—that is, among those who will receive the new drug the next day.
When the experiment is analyzed, an unexpected pattern emerges: a Simpson’s reversal. Although the drug appears beneficial to the population as a whole, within both subgroups (lollipop receivers and nonreceivers) drug takers are less likely to recover than nontakers.
Assume that receiving and sucking a lollipop has no direct effect whatsoever on recovery.
Using this setup, answer the following questions:
(a) Is the drug beneficial to the population as a whole or harmful?
(b) Does your answer contradict the gender example discussed in class, where sex-specific data were deemed more appropriate for determining the causal effect?
(c) Draw (informally) a causal graph that captures the essential structure of the story.
(d) Explain how Simpson’s reversal arises in this scenario. What roles do the variables lollipop, treatment assignment, and recovery play?
(e) Would your explanation change if the lollipops were handed out (according to the same criterion) after the study rather than before?
Hint. Receiving a lollipop is an indicator of two things:
a higher likelihood of being assigned to the drug treatment group, and
a higher likelihood of depression, which in turn is associated with a lower probability of recovery.
Use this information to reason about the direction of confounding and the emergence of the Simpson’s reversal.
Problem 1.4 — The Monty Hall problem¶
In the late 1980s, a writer named Marilyn vos Savant started a regular column in Parade magazine, a weekly supplement to the Sunday newspaper in many U.S. cities. Her column, Ask Marilyn, continues to this day and features her answers to various puzzles, brainteasers, and scientific questions submitted by readers. The magazine billed her as “the world’s smartest woman,” which undoubtedly motivated readers to come up with a question that would stump her.
Of all the questions she ever answered, none created a greater furor than this one, which appeared in a column in September 1990:
“Suppose you’re on a game show, and you’re given the choice of three doors. Behind one door is a car, behind the others, goats. You pick a door, say #1, and the host, who knows what’s behind the doors, opens another door, say #3, which has a goat. He says to you, ‘Do you want to pick door #2?’ Is it to your advantage to switch your choice of doors?”
For American readers, the question was obviously based on a popular televised game show called Let’s Make a Deal, whose host, Monty Hall, used to play precisely this sort of mind game with the contestants. In her answer, vos Savant argued that contestants should switch doors. By not switching, they would have only a one-in-three probability of winning; by switching, they would double their chances to two in three.
Even the smartest woman in the world could never have anticipated what happened next. Over the next few months, she received more than 10,000 letters from readers, most of them disagreeing with her, and many of them from people who claimed to have PhDs in mathematics or statistics. A small sample of the comments from academics includes:
“You blew it, and you blew it big!” (Scott Smith, PhD)
“May I suggest that you obtain and refer to a standard textbook on probability before you try to answer a question of this type again?” (Charles Reid, PhD)
“You blew it!” (Robert Sachs, PhD)
“You are utterly incorrect.” (Ray Bobo, PhD)
In general, the critics argued that it shouldn’t matter whether you switch doors or not—there are only two doors left in the game, and you have chosen your door completely at random, so the probability that the car is behind your door must be one-half either way.
(a) Who was right? Who was wrong? And why does the problem incite such passion? Provide a qualitative explanation by considering the following table:
| Door 1 | Door 2 | Door 3 | Outcome if you switch | Outcome if you stay |
|---|---|---|---|---|
| Auto | Goat | Goat | Lose | Win |
| Goat | Auto | Goat | Win | Lose |
| Goat | Goat | Auto | Win | Lose |
The table shows that switching doors is twice as attractive as not switching.
(b) Prove, using Bayes’ theorem, that switching doors improves your chances of winning the car in the Monty Hall problem.
(c) Define the structural model that corresponds to the Monty Hall problem, and use it to describe the joint distribution of all variables.
Problem 1.5 — Small Bayesian network calculations¶
Consider the following Bayesian network containing four Boolean random variables :
and have no parents,
has parent ,
has parents and .
The conditional probability tables are:
and for :
(a) Compute .
Choose one:
A. 0.216
B. 0.054
C. 0.024
D. 0.006
E. None of the above
(b) Compute .
Choose one:
A. 0.0315
B. 0.0855
C. 0.368
D. 0.583
E. None of the above
(c) True or False: The Bayesian network associated with the computation
has edges , , , , and no other edges.
(d) True or False: The product
corresponds to a valid Bayesian network over .
Problem 1.6 — Constructing and reading a Bayesian network¶
(a) Given the tables below, draw a minimal representative Bayesian network for this model. Be sure to label all nodes and the directionality of the edges.
Marginal for :
| 0.1 | |
| 0.9 |
Conditional for given :
| 0.7 | ||
| 0.3 | ||
| 0.5 | ||
| 0.5 |
Conditional for given :
| 0.7 | ||
| 0.3 | ||
| 0.8 | ||
| 0.2 |
Conditional for given and :
| 0.9 | |||
| 0.1 | |||
| 0.8 | |||
| 0.2 | |||
| 0.6 | |||
| 0.4 | |||
| 0.1 | |||
| 0.9 |
(b) Compute the following probabilities:
Here “” means , “” means , etc.
(c) Which of the following conditional independencies are guaranteed by the above network?
(i.e. and are conditionally independent given )
Problem 1.7 — Simpson’s reversal with a fatal syndrome¶
Assume that a population of patients contains a fraction of individuals who suffer from a certain fatal syndrome , which simultaneously makes it uncomfortable for them to take a life-prolonging drug (see Figure 1 in the exercise sheet).
Let
and represent, respectively, the presence and absence of the syndrome,
and represent death and survival, respectively,
and represent taking and not taking the drug.
Assume that patients not carrying the syndrome () die with probability if they take the drug and with probability if they do not. Patients carrying the syndrome (), on the other hand, die with probability if they do not take the drug and with probability if they do take the drug.
Further, patients having the syndrome are more likely to avoid the drug, with probabilities
(a) Based on this model, compute the joint distributions , , , and for all values of , in terms of the parameters .
Hint. Decompose the product using the graph structure.
(b) Calculate the difference
for three populations:
those carrying the syndrome,
those not carrying the syndrome,
the population as a whole.
(c) Using your results from part (b), find a combination of parameters that exhibits Simpson’s reversal (i.e. the drug appears beneficial in each subgroup but harmful, or vice versa, in the population as a whole).