Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Bayesian Networks - Causal Graphs

Causal AI and the limits of pure prediction

Many impressive machine learning systems (including large language models) excel at prediction:

  • Given XX, they can accurately predict YY.

However, prediction alone is not sufficient when we care about:

  • Interventions: What happens if we change XX?

  • Policy decisions: Should we give a drug? Launch a campaign? Change a workflow?

  • Fairness and bias: Is a model discriminating against a particular group?

  • Explanation and trust: Why did this happen? What are the causes?

A central message of the slides:

Causality is the missing link between pattern recognition and reasoning.

Causal questions require knowledge about how variables influence each other, not just how they correlate in data.

Causality vs. correlation and Simpson’s paradox

The slides use examples like a drug study and exercise–cholesterol relationships to show:

  • Purely observational statistics can lead to paradoxes.

  • Aggregated data can suggest one conclusion, while stratified data suggest the opposite.

A classic example is Simpson’s paradox:

  • A treatment seems worse overall, but better for each subgroup (e.g. men and women separately).

  • The paradox is resolved when we realize that a confounder (e.g. gender) influences both:

    • treatment assignment (who gets the drug), and

    • outcome (recovery).

Key idea:

  • A confounder is a variable that is a common cause of both treatment and outcome.

  • To correctly assess the causal effect of a treatment, we must control for confounders.

Informally:

Causality is about understanding which variables produce changes in others, not just which ones move together.

Informal definition of causation

The slides give an intuitive definition:

  • A variable YY is a direct cause of a variable XX if YY appears in the function that assigns XX’s value.

  • A variable YY is a cause of XX if it is a direct cause of XX, or a cause of some other variable that causes XX.

In other words:

  • If changing YY (while holding all other relevant inputs fixed) would change XX, then YY is a cause of XX.

  • Causal relations are fundamentally about what would happen under interventions, not just observed co-occurrence.

This motivates a more formal framework: structural causal models.

Structural Causal Models (SCMs)

A structural causal model (SCM) describes how variables in a system are generated from exogenous noise and other variables via structural equations.

Formal definition:

A structural causal model

M=V,U,F,P(u)M = \langle V, U, F, P(u) \rangle

consists of:

  • V={V1,,Vn}V = \{V_1,\dots,V_n\}: endogenous variables

    • Determined within the model.

    • Each ViV_i has at least one cause (either exogenous or endogenous).

  • U={U1,,Um}U = \{U_1,\dots,U_m\}: exogenous variables

    • External to the model; we do not model their causes.

    • Represent background factors and noise.

  • F={f1,,fn}F = \{f_1,\dots,f_n\}: structural functions

    • Each ViV_i is determined by a function

      vi=fi(pai,ui),v_i = f_i(\text{pa}_i, u_i),

      where:

      • paiV\text{pa}_i \subseteq V are the endogenous parents (causes) of ViV_i,

      • uiUu_i \subseteq U are the exogenous parents of ViV_i.

  • P(u)P(u): joint probability distribution over the exogenous variables UU.

Assumption in the slides:

  • The model is recursive (or acyclic): there are no feedback loops or cycles such that a variable is ultimately a cause of itself.

Example of structural equations

Example from the slides: exercise, diet, and fitness level.

Variables:

  • XX: amount of exercise,

  • YY: diet quality,

  • ZZ: fitness level.

Exogenous variables:

  • UXU_X: external motivation affecting exercise,

  • UYU_Y: education / culture affecting diet,

  • UZU_Z: genetic predisposition and other unobserved factors affecting fitness.

Structural equations:

X=fX(UX),X = f_X(U_X),
Y=fY(UY),Y = f_Y(U_Y),
Z=fZ(X,Y,UZ).Z = f_Z(X, Y, U_Z).

Interpretation:

  • XX and YY are causes of ZZ because they appear in the structural function fZf_Z.

  • UXU_X, UYU_Y, and UZU_Z are exogenous and are not caused by any variable in VV.

Structural Equation Models (SEMs)

A structural equation model (SEM) is a special case of an SCM with:

  • A fixed causal ordering of variables.

  • Linear structural functions.

  • Typically Gaussian exogenous noise.

For example, with variables X,Y,ZX, Y, Z and exogenous errors εX,εY,εZ\varepsilon_X, \varepsilon_Y, \varepsilon_Z:

Z=βZ0+εZ,Z = \beta_{Z0} + \varepsilon_Z,
X=βX0+βXZZ+εX,X = \beta_{X0} + \beta_{XZ} Z + \varepsilon_X,
Y=βY0+βYZZ+βYXX+εY.Y = \beta_{Y0} + \beta_{YZ} Z + \beta_{YX} X + \varepsilon_Y.

Assumptions:

  • Exogenous errors are often taken to be independent Gaussians:

    εX,εY,εZN(0,Σ),\varepsilon_X, \varepsilon_Y, \varepsilon_Z \sim \mathcal{N}(0,\Sigma),

    where Σ\Sigma is typically diagonal.

SEMs are widely used in fields like psychology and economics, where linear relationships and Gaussian noise are reasonable approximations.

Graphical causal models

Every SCM can be associated with a graphical causal model (a directed graph):

  • Nodes: variables in UU and VV.

  • Directed edges: from each parent to its child, according to the structural equations.

Graph construction:

  • For each structural equation

    Vi=fi(pai,ui),V_i = f_i(\text{pa}_i, u_i),

    draw directed edges from each variable in pai\text{pa}_i and uiu_i to ViV_i.

Properties:

  • Exogenous variables (UU) appear as root nodes (no parents).

  • Endogenous variables (VV) are descendants of at least one exogenous variable.

  • In models considered here, the graph is a directed acyclic graph (DAG):

    • No cycles or self-loops.

    • You cannot follow directed edges and return to the same node.

Graphical definition of causation:

  • If there is a directed edge YXY \to X, then YY is a direct cause of XX.

  • If there is a directed path from YY to XX, then YY is a potential cause of XX.

Basics of graph terminology

Some basic concepts from graph theory used in the slides:

  • A graph consists of nodes (vertices) and edges (links).

  • Edges can be:

    • Undirected: no arrowheads, just a line between nodes.

    • Directed: arrows indicating direction.

In a directed graph:

  • If there is a directed edge YXY \to X:

    • YY is a parent of XX,

    • XX is a child of YY.

  • If there is a directed path YXY \to \dots \to X:

    • YY is an ancestor of XX,

    • XX is a descendant of YY.

A Directed Acyclic Graph (DAG) is a directed graph with no cycles:

  • You cannot start at a node, follow directed edges, and come back to the same node.

  • DAGs are the graph type used in Bayesian networks and many causal models.

Why use graphical models instead of raw SCM equations?

SCMs specify exact functional relationships Vi=fi(pai,ui)V_i = f_i(\text{pa}_i, u_i), which can be difficult to elicit and manipulate directly.

Graphical models offer:

  • A qualitative description of causal structure (who causes whom),

  • A compact way to encode conditional independencies between variables,

  • A way to factorize joint probability distributions into simpler pieces,

  • An intuitive visual language for communicating assumptions.

Typically:

  • We first think in terms of graphs / DAGs,

  • Then attach local probability distributions (Bayesian networks),

  • And sometimes further refine to structural equations when functional forms are needed.

Bayesian networks: definition

Let X=(X1,,Xn)X = (X_1,\dots,X_n) be random variables.

A Bayesian network is:

  • A DAG whose nodes are the variables X1,,XnX_1,\dots,X_n.

  • For each node XiX_i, a local conditional distribution

    p(xixParents(i)),p(x_i \mid x_{\text{Parents}(i)}),

    where Parents(i)\text{Parents}(i) are the parents of XiX_i in the DAG.

The Bayesian network defines a joint distribution over all variables by the product

P(X1=x1,,Xn=xn)=i=1np(xixParents(i)).P(X_1 = x_1,\dots,X_n = x_n) = \prod_{i=1}^n p(x_i \mid x_{\text{Parents}(i)}).

Key points:

  • For nodes with no parents, p(xixParents(i))p(x_i \mid x_{\text{Parents}(i)}) is just a marginal p(xi)p(x_i).

  • For nodes with parents, we specify conditional probabilities given the parent configuration.

  • The arrows in a probabilistic Bayesian network encode conditional dependence structure (not automatically causality).

The slides distinguish later between probabilistic Bayesian networks and causal Bayesian networks.

Explaining away with Bayesian networks

A classic structure from the slides:

  • BB: burglary (0/1),

  • EE: earthquake (0/1),

  • AA: alarm (0/1).

Graph:

BAE.B \to A \leftarrow E.

Local distributions:

  • p(b)p(b) and p(e)p(e) for burglary and earthquake.

  • p(ab,e)p(a \mid b,e) for the alarm, which depends on both BB and EE.

Joint factorization:

P(B=b,E=e,A=a)=p(b)p(e)p(ab,e).P(B=b, E=e, A=a) = p(b)\, p(e)\, p(a \mid b,e).

Explaining away:

  • Suppose we know A=1A=1 (alarm goes off).

  • Burglary and earthquake are both possible causes.

  • If we now learn that E=1E=1 (there is an earthquake), the probability of burglary decreases:

    P(B=1A=1,E=1)<P(B=1A=1).P(B=1 \mid A=1, E=1) < P(B=1 \mid A=1).

Intuition:

  • Before knowing EE, a burglary was a plausible explanation for the alarm.

  • Once we know there was an earthquake, the alarm is largely “explained” by EE, so a burglary becomes less likely.

This explaining away effect is typical for converging connections (called colliders) BAEB \to A \leftarrow E.

Remarks on explaining away

Important details emphasized in the slides:

  • Without conditioning on the effect AA, the causes BB and EE can be independent:

    P(BE)=P(B).P(B \mid E) = P(B).
  • Once we condition on the effect A=1A=1, BB and EE become dependent:

    P(BA=1,E=1)P(BA=1).P(B \mid A=1,E=1) \ne P(B \mid A=1).
  • This induced dependence through a common effect is central to understanding many probabilistic puzzles.

Takeaway:

In Bayesian networks, conditioning on a collider can create dependencies between its parents (explaining away).

Medical diagnosis example: cold or allergies?

The slides present a small Bayesian network for a toy medical diagnosis problem.

Variables:

  • CC: cold,

  • AA: allergies,

  • HH: cough,

  • II: itchy eyes.

Graph structure:

  • CHC \to H,

  • AHA \to H,

  • AIA \to I.

Factorization of the joint distribution:

P(C=c,A=a,H=h,I=i)=p(c)p(a)p(hc,a)p(ia).P(C=c, A=a, H=h, I=i) = p(c)\, p(a)\, p(h \mid c,a)\, p(i \mid a).

Given this model, we can ask questions like:

  • P(C=1H=1)P(C=1 \mid H=1): probability of a cold given a cough.

  • P(C=1H=1,I=1)P(C=1 \mid H=1, I=1): probability of a cold given both cough and itchy eyes.

Typically, we observe an explaining away effect:

  • Observing itchy eyes makes allergies more likely,

  • Which in turn reduces the probability that the cough is due to a cold.

Flu virus example and conditional independence

Another example in the slides:

Variables:

  • T{cold,hot}T \in \{\text{cold},\text{hot}\}: ambient temperature,

  • V{yes,no}V \in \{\text{yes},\text{no}\}: presence of flu virus,

  • F{sick,¬sick}F \in \{\text{sick},\neg\text{sick}\}: having flu.

Graph:

TVF.T \to V \to F.

Interpretation:

  • Cold temperature TT influences the prevalence of the flu virus VV,

  • The presence of the virus VV influences whether a person gets sick FF.

Key conditional independence:

  • When the virus status VV is known, temperature and sickness are conditionally independent:

    TFV.T \perp F \mid V.

The chain rule factorization (using the graph) is:

P(F=f,V=v,T=t)=p(fv)p(vt)p(t).P(F=f,V=v,T=t) = p(f \mid v)\, p(v \mid t)\, p(t).

From this factorization we can compute:

  • Joint distributions over (F,V,T)(F,V,T),

  • Marginals like P(F=f,T=t)P(F=f,T=t),

  • Conditionals like p(tf)p(t \mid f) using marginalization and Bayes’ rule.

For example,

p(tf)=P(f,t)P(f)=vP(f,v,t)tvP(f,v,t).p(t \mid f) = \frac{P(f,t)}{P(f)} = \frac{\sum_v P(f,v,t)}{\sum_{t'} \sum_v P(f,v,t')}.

Compactness of Bayesian networks

A full joint distribution over nn discrete variables typically requires specifying probabilities for all combinations of values.

  • If each variable has kk states, we need about knk^n numbers.

Bayesian networks exploit the graph structure to express the joint as

P(X1,,Xn)=i=1np(XiParents(i)).P(X_1,\dots,X_n) = \prod_{i=1}^n p(X_i \mid \text{Parents}(i)).

Each local distribution involves only a small subset of variables, greatly reducing the number of parameters needed.

Flu example:

P(F,V,T)=p(FV)p(VT)p(T),P(F,V,T) = p(F \mid V)\, p(V \mid T)\, p(T),

so we only need:

  • A table for p(T)p(T),

  • A table for p(VT)p(V \mid T),

  • A table for p(FV)p(F \mid V),

instead of a full table for P(F,V,T)P(F,V,T).

Benefits:

  • More compact representation,

  • Easier elicitation from domain experts,

  • More efficient computation and inference.

Probabilistic inference in Bayesian networks

General inference problem:

  • Given:

    • A Bayesian network defining P(X1,,Xn)P(X_1,\dots,X_n),

    • Evidence E=eE=e (observed values for a subset of variables EXE \subset X),

    • A query set QXQ \subset X,

  • Compute:

    P(QE=e),P(Q \mid E=e),

    often represented as a probability table over all values of QQ.

Example query:

  • In the medical diagnosis network:

    P(CH=1,I=1),P(C \mid H=1, I=1),

    the probability of a cold given a cough and itchy eyes.

In principle, inference can be done by:

  1. Forming the joint distribution via the product rule.

  2. Summing out (marginalizing) non-query, non-evidence variables.

  3. Normalizing to get conditional probabilities.

However:

  • Exact inference can be computationally expensive in large networks.

  • Many specialized algorithms (variable elimination, message passing, approximate methods) have been developed.

The slides focus more on the conceptual structure rather than algorithms, but the key idea is:

Once the graph and local conditional distributions are specified, any probability over variables can, in principle, be computed.

Probabilistic vs causal Bayesian networks vs SCMs

The slides conclude by contrasting three related but distinct notions:

  1. Probabilistic Bayesian network

    • Arrows into a node YY indicate that the probability of YY is described by

      p(yParents(Y)).p(y \mid \text{Parents}(Y)).
    • The network defines a joint distribution purely as a factorizable probability model.

    • Arrows encode conditional dependencies, but not necessarily causal relations.

  2. Causal Bayesian network

    • Same graphical structure, but arrows are interpreted causally.

    • Conditional probabilities represent the distribution of YY under interventions on its parents:

      p(ydo(Parents(Y))).p(y \mid \text{do}(\text{Parents}(Y))).
    • Still specifies probabilities, but with an explicit causal interpretation.

  3. Structural causal model (SCM)

    • Consists of structural functions Vi=fi(pai,ui)V_i = f_i(\text{pa}_i, u_i) and a distribution over UU.

    • No explicit conditional probability tables at the level of the structural functions.

    • Causal semantics arise directly from the structural equations, and interventions correspond to modifying them.

In short:

  • SCMs provide the most detailed causal mechanism.

  • Causal Bayesian networks provide a graphical abstraction of an SCM.

  • Probabilistic Bayesian networks may have the same graphical structure but do not automatically encode causality.

Interventions (using Pearl’s do-calculus) and causal effects are treated in more detail in later lectures (Bayesian Networks II/III).

Exercises

This notebook contains the text of Series 1 – Bayesian Networks exercises, converted from the PDF into Markdown with LaTeX formulas using the $...$ / $$...$$ notation.

You can use this as a starting point to write your own solutions in additional cells.

Problem 1.1 — Simpson’s paradox in batting averages

For baseball fans: the table below gives hits and at-bats for two players (David Justice and Derek Jeter) in three seasons.

Player199519961997All three years
David Justice104/4110.253104/411 \approx 0.25345/1400.32145/140 \approx 0.321163/4950.329163/495 \approx 0.329312/10460.298312/1046 \approx 0.298
Derek Jeter12/480.25012/48 \approx 0.250183/5820.314183/582 \approx 0.314190/6540.291190/654 \approx 0.291385/12840.300385/1284 \approx 0.300

Note the paradoxical pattern: Justice has the higher batting average in each of the three seasons (1995, 1996, 1997), yet Jeter has the higher batting average when the three seasons are combined. In this exercise you will explain qualitatively how this Simpson phenomenon can arise and (optionally) analyze the data quantitatively with Bayesian models.

(a) How can one player be a worse hitter than the other in 1995, 1996, and 1997 but better over the three-year period? Dissolve the paradox qualitatively.

(b) (Optional) Propose a hierarchical (exchangeable) Beta–Binomial model that

  1. models season-level batting probabilities θp,s\theta_{p,s},

  2. allows pooling across seasons (and possibly across players), and

  3. has weakly informative hyperpriors.

Compare the conclusions (probability statements, intervals) from the hierarchical model to the independent per-season and aggregated analyses. Which analysis do you find most persuasive for estimating future batting performance and why?

Problem 1.2 — Aggregated vs segregated data

In each of the following scenarios, you are asked to decide whether the aggregate data (pooled across all subgroups) or the segregated data (analyzed separately by relevant subgroups) should be used to infer the true causal effect.

Provide a short justification for each answer, explaining which type of data better represents the causal relationship of interest.

(a) Kidney stone treatments I.
In an observational study published in 1996, open surgery to remove kidney stones had a better success rate than endoscopic surgery for small kidney stones. It also had a better success rate for large kidney stones. However, it had a lower success rate overall. Dissolve the paradox.

(b) Kidney stone treatments II.
There are two treatments used on kidney stones: Treatment A and Treatment B. Doctors are more likely to use Treatment A on large (and therefore more severe) stones and more likely to use Treatment B on small stones. Should a patient who doesn’t know the size of his or her stone examine the general population data or the stone size-specific data when deciding which treatment they would like to request?

(c) Smoking and thyroid disease survival.
A 1995 study on thyroid disease reported that smokers had a higher twenty-year survival rate (76%) than nonsmokers (69%). However, when survival rates were analyzed within seven age groups, nonsmokers had higher survival in six of the seven groups, with only a minimal difference in the remaining one. To assess the causal effect of smoking on survival, should the analysis be based on the overall (aggregate) survival rates or on the age-specific (segregated) data?

(d) Surgical performance of two doctors.
In a small town, two doctors have each performed 100 surgeries, divided into two categories: one very difficult surgery and one very easy surgery. Doctor 1 performs mostly easy surgeries, while Doctor 2 performs mostly difficult ones. You need surgery but do not know whether your case will be easy or difficult. To maximize your chances of a successful operation, should you compare the doctors’ overall (aggregate) success rates, or their success rates within each type of surgery (segregated data)?

Problem 1.3 — Simpson’s reversal with a lollipop

In an attempt to estimate the effectiveness of a new drug, a randomized experiment is conducted. Half of the patients (50%) are assigned to receive the new drug, and the remaining half (50%) are assigned to receive a placebo.

A day before the actual experiment, a nurse distributes lollipops to some patients who show signs of depression. By coincidence, most of these patients happen to be in the treatment-bound ward—that is, among those who will receive the new drug the next day.

When the experiment is analyzed, an unexpected pattern emerges: a Simpson’s reversal. Although the drug appears beneficial to the population as a whole, within both subgroups (lollipop receivers and nonreceivers) drug takers are less likely to recover than nontakers.

Assume that receiving and sucking a lollipop has no direct effect whatsoever on recovery.

Using this setup, answer the following questions:

(a) Is the drug beneficial to the population as a whole or harmful?

(b) Does your answer contradict the gender example discussed in class, where sex-specific data were deemed more appropriate for determining the causal effect?

(c) Draw (informally) a causal graph that captures the essential structure of the story.

(d) Explain how Simpson’s reversal arises in this scenario. What roles do the variables lollipop, treatment assignment, and recovery play?

(e) Would your explanation change if the lollipops were handed out (according to the same criterion) after the study rather than before?

Hint. Receiving a lollipop is an indicator of two things:

  • a higher likelihood of being assigned to the drug treatment group, and

  • a higher likelihood of depression, which in turn is associated with a lower probability of recovery.

Use this information to reason about the direction of confounding and the emergence of the Simpson’s reversal.

Problem 1.4 — The Monty Hall problem

In the late 1980s, a writer named Marilyn vos Savant started a regular column in Parade magazine, a weekly supplement to the Sunday newspaper in many U.S. cities. Her column, Ask Marilyn, continues to this day and features her answers to various puzzles, brainteasers, and scientific questions submitted by readers. The magazine billed her as “the world’s smartest woman,” which undoubtedly motivated readers to come up with a question that would stump her.

Of all the questions she ever answered, none created a greater furor than this one, which appeared in a column in September 1990:

“Suppose you’re on a game show, and you’re given the choice of three doors. Behind one door is a car, behind the others, goats. You pick a door, say #1, and the host, who knows what’s behind the doors, opens another door, say #3, which has a goat. He says to you, ‘Do you want to pick door #2?’ Is it to your advantage to switch your choice of doors?”

For American readers, the question was obviously based on a popular televised game show called Let’s Make a Deal, whose host, Monty Hall, used to play precisely this sort of mind game with the contestants. In her answer, vos Savant argued that contestants should switch doors. By not switching, they would have only a one-in-three probability of winning; by switching, they would double their chances to two in three.

Even the smartest woman in the world could never have anticipated what happened next. Over the next few months, she received more than 10,000 letters from readers, most of them disagreeing with her, and many of them from people who claimed to have PhDs in mathematics or statistics. A small sample of the comments from academics includes:

  • “You blew it, and you blew it big!” (Scott Smith, PhD)

  • “May I suggest that you obtain and refer to a standard textbook on probability before you try to answer a question of this type again?” (Charles Reid, PhD)

  • “You blew it!” (Robert Sachs, PhD)

  • “You are utterly incorrect.” (Ray Bobo, PhD)

In general, the critics argued that it shouldn’t matter whether you switch doors or not—there are only two doors left in the game, and you have chosen your door completely at random, so the probability that the car is behind your door must be one-half either way.

(a) Who was right? Who was wrong? And why does the problem incite such passion? Provide a qualitative explanation by considering the following table:

Door 1Door 2Door 3Outcome if you switchOutcome if you stay
AutoGoatGoatLoseWin
GoatAutoGoatWinLose
GoatGoatAutoWinLose

The table shows that switching doors is twice as attractive as not switching.

(b) Prove, using Bayes’ theorem, that switching doors improves your chances of winning the car in the Monty Hall problem.

(c) Define the structural model that corresponds to the Monty Hall problem, and use it to describe the joint distribution of all variables.

Problem 1.5 — Small Bayesian network calculations

Consider the following Bayesian network containing four Boolean random variables A,B,C,DA,B,C,D:

  • AA and BB have no parents,

  • CC has parent AA,

  • DD has parents AA and BB.

The conditional probability tables are:

  • P(A)=0.1P(A) = 0.1

  • P(B)=0.5P(B) = 0.5

  • P(CA)=0.7P(C \mid A) = 0.7

  • P(C¬A)=0.2P(C \mid \neg A) = 0.2

and for DD:

  • P(DA,B)=0.9P(D \mid A,B) = 0.9

  • P(DA,¬B)=0.7P(D \mid A,\neg B) = 0.7

  • P(D¬A,B)=0.6P(D \mid \neg A,B) = 0.6

  • P(D¬A,¬B)=0.3P(D \mid \neg A,\neg B) = 0.3

(a) Compute P(¬A,B,¬C,D)P(\neg A, B, \neg C, D).

Choose one:

A. 0.216
B. 0.054
C. 0.024
D. 0.006
E. None of the above


(b) Compute P(AB,C,D)P(A \mid B, C, D).

Choose one:

A. 0.0315
B. 0.0855
C. 0.368
D. 0.583
E. None of the above


(c) True or False: The Bayesian network associated with the computation

P(A)P(B)P(CA,B)P(DC)P(EB,C)P(A)\,P(B)\,P(C \mid A,B)\,P(D \mid C)\,P(E \mid B,C)

has edges ACA \to C, BCB \to C, BEB \to E, CDC \to D, CEC \to E and no other edges.


(d) True or False: The product

P(AB)P(BC)P(CD)P(DA)P(A \mid B)\,P(B \mid C)\,P(C \mid D)\,P(D \mid A)

corresponds to a valid Bayesian network over A,B,C,DA,B,C,D.

Problem 1.6 — Constructing and reading a Bayesian network

(a) Given the tables below, draw a minimal representative Bayesian network for this model. Be sure to label all nodes and the directionality of the edges.

Marginal for DD:

DDP(D)P(D)
++0.1
-0.9

Conditional for BB given DD:

DDBBP(BD)P(B \mid D)
++++0.7
++-0.3
-++0.5
--0.5

Conditional for XX given DD:

DDXXP(XD)P(X \mid D)
++++0.7
++-0.3
-++0.8
--0.2

Conditional for AA given DD and XX:

DDXXAAP(AD,X)P(A \mid D,X)
++++++0.9
++++-0.1
++-++0.8
++--0.2
-++++0.6
-++-0.4
--++0.1
---0.9

(b) Compute the following probabilities:

  1. P(+d+b)P(+d \mid +b)

  2. P(+d,+a)P(+d, +a)

  3. P(+d+a)P(+d \mid +a)

Here “+d+d” means D=+D = +, “+b+b” means B=+B = +, etc.


(c) Which of the following conditional independencies are guaranteed by the above network?

  • X ⁣ ⁣ ⁣BDX \perp\!\!\!\perp B \mid D (i.e. XX and BB are conditionally independent given DD)

  • D ⁣ ⁣ ⁣AXD \perp\!\!\!\perp A \mid X

  • D ⁣ ⁣ ⁣ABD \perp\!\!\!\perp A \mid B

  • D ⁣ ⁣ ⁣XAD \perp\!\!\!\perp X \mid A

Problem 1.7 — Simpson’s reversal with a fatal syndrome

Assume that a population of patients contains a fraction rr of individuals who suffer from a certain fatal syndrome ZZ, which simultaneously makes it uncomfortable for them to take a life-prolonging drug XX (see Figure 1 in the exercise sheet).

Let

  • Z=z1Z = z_1 and Z=z0Z = z_0 represent, respectively, the presence and absence of the syndrome,

  • Y=y1Y = y_1 and Y=y0Y = y_0 represent death and survival, respectively,

  • X=x1X = x_1 and X=x0X = x_0 represent taking and not taking the drug.

Assume that patients not carrying the syndrome (Z=z0Z = z_0) die with probability p2p_2 if they take the drug and with probability p1p_1 if they do not. Patients carrying the syndrome (Z=z1Z = z_1), on the other hand, die with probability p3p_3 if they do not take the drug and with probability p4p_4 if they do take the drug.

Further, patients having the syndrome are more likely to avoid the drug, with probabilities

q1=P(x1z0),q2=P(x1z1).q_1 = P(x_1 \mid z_0), \qquad q_2 = P(x_1 \mid z_1).

(a) Based on this model, compute the joint distributions P(x,y,z)P(x,y,z), P(x,y)P(x,y), P(x,z)P(x,z), and P(y,z)P(y,z) for all values of x,y,zx,y,z, in terms of the parameters (r,p1,p2,p3,p4,q1,q2)(r, p_1, p_2, p_3, p_4, q_1, q_2).

Hint. Decompose the product using the graph structure.


(b) Calculate the difference

P(y1x1)P(y1x0)P(y_1 \mid x_1) - P(y_1 \mid x_0)

for three populations:

  1. those carrying the syndrome,

  2. those not carrying the syndrome,

  3. the population as a whole.


(c) Using your results from part (b), find a combination of parameters that exhibits Simpson’s reversal (i.e. the drug appears beneficial in each subgroup but harmful, or vice versa, in the population as a whole).