Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Bayesian Networks I

Problem 1.1

For baseball fans: Table 1 gives hits and at-bats for two players (David Justice and Derek Jeter) in three seasons.

Player199519961997All three years
David Justice104/411=0.253104/411 = 0.25345/140=0.32145/140 = 0.321163/495=0.329163/495 = 0.329312/1046=0.298312/1046 = 0.298
Derek Jeter12/48=0.25012/48 = 0.250183/582=0.314183/582 = 0.314190/654=0.291190/654 = 0.291385/1284=0.300385/1284 = 0.300

Table 1: Hits / At-bats (seasonal and aggregated).

Note the paradoxical pattern: Justice has the higher batting average in each of the three seasons (1995, 1996, 1997), yet Jeter has the higher batting average when the three seasons are combined. In this exercise you will explain qualitatively how this “Simpson” phenomenon can arise and (optionally) analyze quantitatively the data with Bayesian models.

a) How can one player be a worse hitter than the other in 1995, 1996, and 1997 but better over the three-year period? Dissolve the paradoxon (qualitatively). b) (Optional) Propose a hierarchical (exchangeable) Beta-Binomial model that (i) models season-level batting probabilities θp,s\theta_{p,s} (ii) allows pooling across seasons (and possibly across players), and (iii) has weakly informative hyperpriors. Compare the conclusions (probability statements, intervals) from the hierarchical model to the independent per-season and aggregated analyses. Which analysis do you find most persuasive for estimating future batting performance and why?

Solution 1.1 a) This apparent reversal is reminiscent of the Simpson’s paradox phenomenon. The key lies in the fact that the number of at-bats (the denominators) is not distributed evenly across years. Derek Jeter had very few at-bats in 1995, so his relatively low batting average that year had little influence on his overall average. David Justice, by contrast, had many more at-bats in his least productive year (1995), which substantially pulled down his combined average. Once we recognize that being a “better hitter” is not defined by a direct head-to-head comparison but rather by a weighted average that reflects how often each player batted, the paradox dissolves.

Problem 1.2

In each of the following scenarios, you are asked to decide whether the aggregate data (pooled across all subgroups) or the segregated data (analyzed separately by relevant subgroups) should be used to infer the true causal effect. Provide a short justification for each answer, explaining which type of data better represents the causal relationship of interest.

(a) Kidney stone treatments I: In an observational study published in 1996, open surgery to remove kidney stones had a better success rate than endoscopic surgery for small kidney stones. It also had a better success rate for large kidney stones. However, it had a lower success rate overall. Dissolve the paradoxon.

(b) Kidney stone treatments II: There are two treatments used on kidney stones: Treatment A and Treatment B. Doctors are more likely to use Treatment A on large (and therefore, more severe) stones and more likely to use Treatment B on small stones. Should a patient who doesn’t know the size of his or her stone examine the general population data, or the stone size-specific data when deciding which treatment they would like to request?

(c) Smoking and thyroid disease survival: A 1995 study on thyroid disease reported that smokers had a higher twenty-year survival rate (76%) than nonsmokers (69%). However, when survival rates were analyzed within seven age groups, nonsmokers had higher survival in six of the seven groups, with only a minimal difference in the remaining one. To assess the causal effect of smoking on survival, should the analysis be based on the overall (aggregate) survival rates or on the age-specific (segregated) data?

(d) Surgical performance of two doctors: In a small town, two doctors have each performed 100 surgeries, divided into two categories: one very difficult surgery and one very easy surgery. Doctor 1 performs mostly easy surgeries, while Doctor 2 performs mostly difficult ones. You need surgery but do not know whether your case will be easy or difficult. To maximize your chances of a successful operation, should you compare the doctors’ overall (aggregate) success rates, or their success rates within each type of surgery (segregated data)?

Solution 1.2 a) Kidney stone treatments I: Larger stones were more likely to lead to open surgery and also had a worse prognosis. b) Kidney stone treatments II: Stone size is a common cause of treatment choice and recovery. Treatment does not change stone size. We should consult the segregated data conditioned on stone size. c) Smoking and thyroid disease: Age is a confounder. Stratifying by age, we conclude smoking has a negative impact. d) Surgical performance: Difficulty is a common cause. Consult segregated data conditioned on difficulty.

Problem 1.3

In an attempt to estimate the effectiveness of a new drug, a randomized experiment is conducted. Half of the patients (50%) are assigned to receive the new drug, and the remaining half (50%) are assigned to receive a placebo. A day before the actual experiment, a nurse distributes lollipops to some patients who show signs of depression. By coincidence, most of these patients happen to be in the treatment-bound ward—that is, among those who will receive the new drug the next day. When the experiment is analyzed, an unexpected pattern emerges: a Simpson’s reversal. Although the drug appears beneficial to the population as a whole, within both subgroups (lollipop receivers and nonreceivers) drug takers are less likely to recover than nontakers.

Assume that receiving and sucking a lollipop has no direct effect whatsoever on recovery. Using this setup, answer the following questions:

(a) Is the drug beneficial to the population as a whole or harmful? (b) Does your answer contradict the gender example discussed in class, where sex-specific data were deemed more appropriate for determining the causal effect? (c) Draw (informally) a causal graph that captures the essential structure of the story. (d) Explain how Simpson’s reversal arises in this scenario. What roles do the variables lollipop, treatment assignment, and recovery play? (e) Would your explanation change if the lollipops were handed out (according to the same criterion) after the study rather than before?

Hint: Receiving a lollipop is an indicator of two things: a higher likelihood of being assigned to the drug treatment group, and a higher likelihood of depression, which in turn is associated with a lower probability of recovery. Use this information to reason about the direction of confounding and the emergence of the Simpson’s reversal.

Solution 1.3 a) Aggregated data is correct; the drug is beneficial. Disaggregated data is biased due to confounding by lollipop receipt. b) This does not contradict the gender example. In the gender example, gender was a cause of both treatment and recovery (confounder). Here, lollipop receipt correlates with treatment and recovery but is not causal. c) Causal Graph: U1XU_1 \rightarrow X, U1ZU_1 \rightarrow Z, U2ZU_2 \rightarrow Z, U2YU_2 \rightarrow Y, XYX \rightarrow Y. (Where XX is treatment, YY is recovery, ZZ is lollipop). d) If we condition on lollipop receipt ZZ, we might see a reversal because ZZ is a collider (or proxy for depression/assignment). e) Even if lollipops were distributed after, the conclusion does not change. Lollipop receipt is spuriously associated.

Problem 1.4

In the late 1980s, a writer named Marilyn vos Savant started a regular column in Parade magazine. Her column, Ask Marilyn, features her answers to various puzzles. One question created a great furor:

“Suppose you’re on a game show, and you’re given the choice of three doors. Behind one door is a car, behind the others, goats. You pick a door, say #1, and the host, who knows what’s behind the doors, opens another door, say #3, which has a goat. He says to you, ‘Do you want to pick door #2?’ Is it to your advantage to switch your choice of doors?”

Vos Savant argued that contestants should switch doors. By not switching, they would have only a one-in-three probability of winning; by switching, they would double their chances to two in three. Many academics disagreed, arguing that it shouldn’t matter whether you switch or not.

a) Who was right? Who was wrong? And why does the problem incite such passion? Provide a qualitative explanation by considering the following table:

Door 1Door 2Door 3Outcome If You SwitchOutcome If You Stay
AutoGoatGoatLoseWin
GoatAutoGoatWinLose
GoatGoatAutoWinLose

Table 2: The three possible arrangements of doors and goats in Let’s Make a Deal, showing that switching doors is twice as attractive as not.

b) Prove, using Bayes’ theorem, that switching doors improves your chances of winning the car in the Monty Hall problem.

c) Define the structural model that corresponds to the Monty Hall problem, and use it to describe the joint distribution of all variables.

Solution 1.4 a) Vos Savant was right. Switching wins with probability 2/3. Staying wins with probability 1/3. b) Proof using Bayes’ theorem: P(Y=AX=A,Z=C)=1/3P(Y=A|X=A,Z=C) = 1/3 (Stay) P(Y=BX=A,Z=C)=2/3P(Y=B|X=A,Z=C) = 2/3 (Switch) (See detailed steps in original text). c) Structural model: V={X,Y,Z}V=\{X,Y,Z\}, U={UX,UY,UZ}U=\{U_X, U_Y, U_Z\}, X=UXX=U_X, Y=UYY=U_Y, Z=f(X,Y)+UZZ=f(X,Y)+U_Z. P(ZX,Y)=1P(Z|X,Y) = 1 if zx,zy,xyz \neq x, z \neq y, x \neq y. P(ZX,Y)=0.5P(Z|X,Y) = 0.5 if zx,zy,x=yz \neq x, z \neq y, x = y.

Problem 1.5

Consider the following Bayesian Network containing four Boolean random variables: ACA \rightarrow C ADA \rightarrow D BDB \rightarrow D (Note: Based on the diagram structure implied by the options and typical BN problems, and the solution text later, the edges are ACA \rightarrow C, ADA \rightarrow D, BDB \rightarrow D. The text diagram in source 815-818 is abstract, but the solution implies specific edges.)

The conditional probability tables are:

  • P(A)=0.1P(A) = 0.1

  • P(B)=0.5P(B) = 0.5

  • P(CA)=0.7P(C|A) = 0.7, P(C¬A)=0.2P(C|\neg A) = 0.2

  • P(DA,B)=0.9P(D|A,B) = 0.9

  • P(D¬A,B)=0.7P(D|\neg A,B) = 0.7

  • P(DA,¬B)=0.6P(D|A,\neg B) = 0.6

  • P(D¬A,¬B)=0.3P(D|\neg A,\neg B) = 0.3

a) Compute P(¬A,B,¬C,D)P(\neg A, B, \neg C, D). A. 0.216 B. 0.054 C. 0.024 D. 0.006 E. None of the above

b) Compute P(AB,C,D)P(A|B, C, D). A. 0.0315 B. 0.0855 C. 0.368 D. 0.583 E. None of the above

c) True or False: The Bayesian Network associated with the computation P(A)P(B)P(CA,B)P(DC)P(EB,C)P(A)P(B)P(C|A,B)P(D|C)P(E|B,C) has edges ACA\rightarrow C, BCB\rightarrow C, BEB\rightarrow E, CDC\rightarrow D, CEC\rightarrow E and no other edges.

d) True or False: The product P(AB)P(BC)P(CD)P(DA)P(A|B)P(B|C)P(C|D)P(D|A) corresponds to a valid Bayesian network over A, B, C, D?

Solution 1.5 a) A (0.216). P(¬A,B,¬C,D)=P(¬C¬A)P(D¬A,B)P(¬A)P(B)=(0.8)(0.6)(0.9)(0.5)=0.216P(\neg A,B,\neg C,D) = P(\neg C|\neg A)P(D|\neg A,B)P(\neg A)P(B) = (0.8)(0.6)(0.9)(0.5) = 0.216. b) C (0.368). P(AB,C,D)=P(A,B,C,D)/P(B,C,D)P(A|B,C,D) = P(A,B,C,D) / P(B,C,D). Numerator: 0.70.90.10.5=0.03150.7 \cdot 0.9 \cdot 0.1 \cdot 0.5 = 0.0315. Denominator term 2: 0.20.60.90.5=0.0540.2 \cdot 0.6 \cdot 0.9 \cdot 0.5 = 0.054. Sum = 0.0855. Ratio 0.368\approx 0.368. c) True. The factors directly map to the listed edges. d) False. The edges form a cycle ADCBAA \rightarrow D \rightarrow C \rightarrow B \rightarrow A.


Problem 1.6

a) Given the tables below, draw a minimal representative Bayesian network of this model.

DP(D)
+0.1
-0.9
DBP(B|D)
++0.7
+-0.3
-+0.5
--0.5
DXP(X|D)
++0.7
+-0.3
-+0.8
--0.2
DXAP(A|D,X)
+++0.9
++-0.1
+-+0.8
+--0.2
-++0.6
-+-0.4
--+0.1
---0.9

Be sure to label all nodes and the directionality of the edges.

b) Compute the following probabilities: P(+d+b)P(+d|+b), P(+d,+a)P(+d,+a), P(+d+a)P(+d|+a)

c) Which of the following conditional independences are guaranteed by the above network?

  • XBDX \perp \perp B | D

  • DAXD \perp \perp A | X

  • DABD \perp \perp A | B

  • DXAD \perp \perp X | A

Solution 1.6 a) Network: DXD \rightarrow X, DBD \rightarrow B, DAD \rightarrow A, XAX \rightarrow A. b) P(+d+b)=0.135P(+d|+b) = 0.135 P(+d,+a)=0.087P(+d,+a) = 0.087 P(+d+a)=0.162P(+d|+a) = 0.162 c) Only XBDX \perp \perp B | D is guaranteed.

Problem 1.7

Assume that a population of patients contains a fraction rr of individuals who suffer from a certain fatal syndrome ZZ, which simultaneously makes it uncomfortable for them to take a life-prolonging drug XX. Let Z=z1Z=z_{1} and Z=z0Z=z_{0} represent, respectively, the presence and absence of the syndrome, Y=y1Y=y_{1} and Y=y0Y=y_{0} represent death and survival, respectively, and X=x1X=x_{1} and X=x0X=x_{0} represent taking and not taking the drug.

Assume that patients not carrying the syndrome (Z=z0Z=z_{0}) die with probability p2p_{2} if they take the drug and with probability p1p_{1} if they do not. Patients carrying the syndrome (Z=z1Z=z_{1}), on the other hand, die with probability p3p_{3} if they do not take the drug and with probability p4p_{4} if they do take the drug. Further, patients having the syndrome are more likely to avoid the drug, with probabilities q1=P(x1z0)q_{1}=P(x_{1}|z_{0}) and q2=P(x1z1)q_{2}=P(x_{1}|z_{1}).

(a) Based on this model, compute the joint distributions P(x,y,z)P(x,y,z), P(x,y)P(x,y), P(x,z)P(x,z), and P(y,z)P(y,z) for all values of xx, yy, and zz, in terms of the parameters (r,p1,p2,p3,p4,q1,q2)(r,p_{1},p_{2},p_{3},p_{4},q_{1},q_{2}). [Hint: Decompose the product.] (b) Calculate the difference P(y1x1)P(y1x0)P(y_{1}|x_{1})-P(y_{1}|x_{0}) for three populations: (1) those carrying the syndrome, (2) those not carrying the syndrome, and (3) the population as a whole. (c) Using your results from part (b), find a combination of parameters that exhibits Simpson’s reversal.

Solution 1.7 a) Joint distributions derived using chain rule P(x,y,z)=P(yx,z)P(xz)P(z)P(x,y,z)=P(y|x,z)P(x|z)P(z). P(x0,y0,z0)=(1p1)(1q1)(1r)P(x_0, y_0, z_0) = (1-p_1)(1-q_1)(1-r) ... (see full list in text) b) (1) Syndrome: p4p3p_4 - p_3 (2) No Syndrome: p2p1p_2 - p_1 (3) Whole: Formula provided in text. c) Example parameters: p1=0.1,p2=0,p3=0.3,p4=0.2,q1=0,q2=1,r=0.1p_1=0.1, p_2=0, p_3=0.3, p_4=0.2, q_1=0, q_2=1, r=0.1. Result: Subgroups have difference -0.1, Whole population has difference +0.1.