Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Bayesian Networks III

Problem 3.1

Even after the smoking and cancer debate was resolved, a major paradox remained. In the mid-1960s, Jacob Yerushalmy observed that newborns of smoking mothers seemed to have better survival rates when restricted to the subgroup of low-birth-weight babies. This finding, known as the birthweight paradox, appeared to contradict medical consensus that smoking during pregnancy increases neonatal mortality.

Yerushalmy’s data, based on more than 15’000 births in the San Francisco Bay Area, showed:

  • Babies of smoking mothers were, on average, lighter at birth.

  • Low-birth-weight babies (below 5.5 pounds) had a mortality rate more than twenty times higher than normal-weight babies.

  • Surprisingly, within the subgroup of low-birth-weight babies, those born to smoking mothers had lower mortality than those of non-smoking mothers.

This paradox persisted for decades until causal inference concepts clarified that the explanation lies in collider bias.

a) Draw a causal diagram (DAG) with the following variables: * Smoking status of the mother * Birth weight of the child * Neonatal mortality * Other unobserved factors (e.g., genetic abnormalities) that also affect birth weight and mortality.

b) Clearly indicate where collider bias arises by explaining why restricting to low-birth-weight infants can lead to the spurious conclusion that smoking appears protective, even if smoking increases mortality overall.


Problem 3.2

Consider a clinical study investigating the effect of a drug (XX) on patient recovery (YY), mediated by blood pressure (ZZ). The causal relationships are:

  • XX (drug) affects ZZ (post-treatment blood pressure)

  • ZZ affects YY (recovery)

  • XX may also have a direct toxic effect on YY

At the end of the experiment, the results are summarized in Table 1.

No drugDrug
Low BP81/87 (93%)234/270 (87%)
High BP192/263 (73%)55/80 (69%)
Combined273/350 (78%)289/350 (83%)

Table 1: Recovery rates stratified by post-treatment blood pressure.

a) Explain why the results summarized in Table 1 seem paradoxical. b) Draw the causal graph by including the following the nodes: XX (drug), ZZ (blood pressure), and YY (recovery). c) Explain why the backdoor criterion does not apply for estimating the causal effect of XX on YY. d) Compute the causal effect numerically using the adjustment formula: In particular, use the front-door adjustment formula to estimate the effect of XX on YY, accounting for the mediator ZZ. e) Compare the effect obtained from the aggregated data with the stratified data (Low BP vs High BP). Explain why the aggregate effect gives the correct recommendation. f) Estimate the causal effect using DoWhy: * Build a small synthetic dataset reflecting the table above. * Define the causal graph in DoWhy * Identify the effect using the front-door criterion * Estimate the causal effect numerically and verify that it matches the manual computation * Discuss why the stratified data alone can be misleading


Problem 3.3

For each of the following games, determine whether the backdoor criterion applies to estimate the causal effect of XX on YY. Indicate which variables (if any) must be adjusted for.

a) Game 1 (Graph with XAX \rightarrow A, AYA \rightarrow Y, ABA \rightarrow B) b) Game 2 (Graph with XX, YY, AA, BB, CC, DD, EE) c) Game 3 (Graph with XX, YY, AA, BB) d) Game 4 (Graph with XX, YY, AA, BB, CC) e) Game 5 (M-graph structure) f) Game 6 (Graph with A,B,C,D,E,F,X,YA, B, C, D, E, F, X, Y) g) Game 7 (Graph with X,Y,A,BX, Y, A, B where B is unobserved)

How do you proceed if BB is unobserved?


Problem 3.4

Consider the following causal DAG: UU (Genotype) X\rightarrow X (Smoking) UU (Genotype) Y\rightarrow Y (Lung Cancer) XX (Smoking) Z\rightarrow Z (Tar Deposits) ZZ (Tar Deposits) Y\rightarrow Y (Lung Cancer)

where:

  • X=X = Smoking {yes, no}

  • Z=Z = Tar {yes, no}

  • Y=Y = Lung Cancer {yes, no}

  • U=U = (Unobserved) Genotype

We want to compute the Average Causal Effect (ACE):

ACE=P(Y=yesdo(X=yes))P(Y=yesdo(X=no))ACE=P(Y=yes|do(X=yes))-P(Y=yes|do(X=no))

using the front-door adjustment.

The observed data is summarized in the following tables:

Table 1:

TarNo TarAll Subjects
Smoker38020400
Nonsmoker20380400

Table 2:

Tar (Smoker)Tar (Nonsmoker)No Tar (Smoker)No Tar (Nonsmoker)
No Cancer323 (85%)1 (5%)18 (90%)342 (90%)
Cancer57 (15%)19 (95%)2 (10%)38 (10%)
(Note: Table reconstructed from text context)

a) Compute the conditional probabilities P(ZX)P(Z|X) and P(YX,Z)P(Y|X,Z) from the data tables. b) Using the front-door adjustment formula, compute P(Y=yesdo(X=yes))P(Y=yes|do(X=yes)) and P(Y=yesdo(X=no))P(Y=yes|do(X=no)) step by step. Assume P(X=yes)=P(X=no)=0.5P(X=yes)=P(X=no)=0.5. c) Compute the ACE based on your results from the previous step and interpret the causal effect of smoking on lung cancer in this dataset.


Problem 3.5

The Lalonde dataset is a well-known benchmark in causal inference. It contains 445 observations with 12 variables (age, educ, black, hisp, married, nodegr, re74, re75, re78, u74, u75, treat). Our goal is to use the DoWhy library to estimate the causal effect of treatment (treat) on the outcome (re78), while controlling for confounding variables.

a) Import dependencies. Load the required libraries: dowhy, pandas, and numpy. b) Load the Lalonde dataset. c) Specify the Causal Model. Define a CausalModel with: * Treatment: treat * Outcome: re78 * Common causes: ["nodegr", "black", "hisp", "age", "educ", "married"] d) Identify the causal effect. Use DoWhy’s identify_effect function. e) Estimate the causal effect. Estimate the Average Treatment Effect (ATE) using propensity score weighting. Compare this causal estimate with a naive difference in means. f) Use the do method to estimate the Average Treatment Effect. g) Refute the estimate: Perform robustness checks using DoWhy refuters: * Add a random common cause * Replace treatment with a placebo (permute) * Use a random subset of the data


Problem 3.6

In this exercise you will gain hands-on experience with causal inference using the DoWhy Python package. We will use the Infant Health and Development Program (IHDP) dataset. You will use DoWhy to estimate the causal effect of the treatment variable (treatment) on the outcome variable (y_factual). Follow the standard causal inference pipeline: model \rightarrow identify \rightarrow estimate \rightarrow refute.

a) Import Dependencies. b) Load the IHDP dataset. c) Specify the Causal Model. Create a CausalModel in DoWhy with treatment, y_factual as outcome, and all covariates x1...x25 as common causes. d) Identify the Estimand. Use the identify_effect function. e) Estimate the Effect. Estimate the Average Treatment Effect (ATE) using the backdoor linear regression method. Compare your estimated causal effect with the raw difference in means. f) Refute the Estimate. Perform robustness checks using at least two refutation methods.


Solutions

Solution 3.1 a) The causal diagram variables: SS (Smoking), BB (Birth Weight), DD (Mortality), UU (Birth Defect/Unobserved). Graph: SBS \rightarrow B, SDS \rightarrow D, UBU \rightarrow B, UDU \rightarrow D. b) Collider bias arises when conditioning on low birth weight (BB) because BB is a collider (SBUS \rightarrow B \leftarrow U). Conditioning on BB opens a back-door path SBUDS \rightarrow B \leftarrow U \rightarrow D, creating a spurious correlation between SS and DD.

Solution 3.2 a) Paradox: Stratified data shows drug is harmful in both subgroups, but aggregated data shows it is beneficial (Simpson’s Paradox). b) Graph: XZYX \rightarrow Z \rightarrow Y and XYX \rightarrow Y. c) The backdoor criterion does not apply because there is no backdoor path from XX to YY (no arrow into XX). ZZ is a mediator, not a confounder. d) Using front-door adjustment (or direct ACE calculation assuming no unobserved confounding): P(Y=1X=0)0.780P(Y=1|X=0) \approx 0.780 P(Y=1X=1)0.826P(Y=1|X=1) \approx 0.826 ACE0.046ACE \approx 0.046. Taking the drug increases recovery probability by 4.6%. f) DoWhy Code:

# Define causal graph
causal_graph = """digraph { x->z; z->y; x->y; }"""
model = CausalModel(data=data, treatment='X', outcome='Y', graph=causal_graph)
# Identify and Estimate
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
estimate = model.estimate_effect(identified_estimand, method_name="backdoor.linear_regression")

Solution 3.3 a) Game 1: No backdoor paths. Adjustment set: \emptyset. b) Game 2: All backdoor paths contain a collider. No adjustment necessary. Adjustment set: \emptyset. c) Game 3: One back-door path XBYX \leftarrow B \rightarrow Y. Adjustment set: {B}\{B\}. d) Game 4: Back-door path blocked by collider BB. Adjustment set: \emptyset. e) Game 5: Two backdoor paths. Need to control for {A,B}\{A, B\} or just {C}\{C\}. f) Game 6: DD plus any node in {A,B,C}\{A, B, C\} will deconfound. g) Game 7: If BB is unobserved, we cannot close the backdoor path XBYX \leftarrow B \rightarrow Y.

Solution 3.4 a) Conditional probabilities calculated from tables. P(Z=yesX=yes)=0.95P(Z=yes|X=yes) = 0.95 P(Y=yesX=yes,Z=yes)=0.15P(Y=yes|X=yes, Z=yes) = 0.15 (etc.) b) Front-door adjustment calculation: P(Y=yesdo(X=yes))=0.5475P(Y=yes|do(X=yes)) = 0.5475 P(Y=yesdo(X=no))=0.5025P(Y=yes|do(X=no)) = 0.5025 c) ACE=0.54750.5025=0.045ACE = 0.5475 - 0.5025 = 0.045. Smoking increases lung cancer probability by 4.5 percentage points.

Solution 3.5 (Lalonde Code)

import dowhy
from dowhy import CausalModel
import dowhy.datasets

# Load data
lalonde = dowhy.datasets.lalonde_dataset()

# Model
common_causes = ["nodegr", "black", "hisp", "age", "educ", "married"]
lalonde_model = CausalModel(
    data=lalonde,
    treatment='treat',
    outcome='re78',
    common_causes=common_causes
)

# Identify
lalonde_identified_estimand = lalonde_model.identify_effect()

# Estimate
lalonde_estimate = lalonde_model.estimate_effect(
    lalonde_identified_estimand,
    method_name="backdoor.propensity_score_weighting"
)
print("Causal Estimate:", lalonde_estimate.value)

# Refute
refute = lalonde_model.refute_estimate(
    lalonde_identified_estimand,
    lalonde_estimate,
    method_name="random_common_cause"
)
print(refute)

Solution 3.6 (IHDP Code)

# Load and process data
cols = ["treatment", "y_factual", "y_cfactual", "mu0", "mu1"] + ["x"+str(i) for i in range(1,26)]
data = pd.read_csv(url, header=None)
data.columns = cols
data = data.astype({"treatment": 'bool'}, copy=False)

# Model
model = CausalModel(
    data=data,
    treatment='treatment',
    outcome='y_factual',
    common_causes=["x"+str(i) for i in range(1,26)]
)

# Identify & Estimate
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.linear_regression",
    test_significance=True
)
print("Causal Estimate:", estimate.value)