Bayesian Networks III

Problem 3.1¶

Even after the smoking and cancer debate was resolved, a major paradox remained. In the mid-1960s, Jacob Yerushalmy observed that newborns of smoking mothers seemed to have better survival rates when restricted to the subgroup of low-birth-weight babies. This finding, known as the birthweight paradox, appeared to contradict medical consensus that smoking during pregnancy increases neonatal mortality.

Yerushalmy’s data, based on more than 15’000 births in the San Francisco Bay Area, showed:

Babies of smoking mothers were, on average, lighter at birth.
Low-birth-weight babies (below 5.5 pounds) had a mortality rate more than twenty times higher than normal-weight babies.
Surprisingly, within the subgroup of low-birth-weight babies, those born to smoking mothers had lower mortality than those of non-smoking mothers.

This paradox persisted for decades until causal inference concepts clarified that the explanation lies in collider bias.

a) Draw a causal diagram (DAG) with the following variables: * Smoking status of the mother * Birth weight of the child * Neonatal mortality * Other unobserved factors (e.g., genetic abnormalities) that also affect birth weight and mortality.

b) Clearly indicate where collider bias arises by explaining why restricting to low-birth-weight infants can lead to the spurious conclusion that smoking appears protective, even if smoking increases mortality overall.

Problem 3.2¶

Consider a clinical study investigating the effect of a drug ( $X$ ) on patient recovery ( $Y$ ), mediated by blood pressure ( $Z$ ). The causal relationships are:

$X$ (drug) affects $Z$ (post-treatment blood pressure)
$Z$ affects $Y$ (recovery)
$X$ may also have a direct toxic effect on $Y$

At the end of the experiment, the results are summarized in Table 1.

	No drug	Drug
Low BP	81/87 (93%)	234/270 (87%)
High BP	192/263 (73%)	55/80 (69%)
Combined	273/350 (78%)	289/350 (83%)

Table 1: Recovery rates stratified by post-treatment blood pressure.

a) Explain why the results summarized in Table 1 seem paradoxical. b) Draw the causal graph by including the following the nodes: $X$ (drug), $Z$ (blood pressure), and $Y$ (recovery). c) Explain why the backdoor criterion does not apply for estimating the causal effect of $X$ on $Y$ . d) Compute the causal effect numerically using the adjustment formula: In particular, use the front-door adjustment formula to estimate the effect of $X$ on $Y$ , accounting for the mediator $Z$ . e) Compare the effect obtained from the aggregated data with the stratified data (Low BP vs High BP). Explain why the aggregate effect gives the correct recommendation. f) Estimate the causal effect using DoWhy: * Build a small synthetic dataset reflecting the table above. * Define the causal graph in DoWhy * Identify the effect using the front-door criterion * Estimate the causal effect numerically and verify that it matches the manual computation * Discuss why the stratified data alone can be misleading

Problem 3.3¶

For each of the following games, determine whether the backdoor criterion applies to estimate the causal effect of $X$ on $Y$ . Indicate which variables (if any) must be adjusted for.

a) Game 1 (Graph with $X \rightarrow A$ , $A \rightarrow Y$ , $A \rightarrow B$ ) b) Game 2 (Graph with $X$ , $Y$ , $A$ , $B$ , $C$ , $D$ , $E$ ) c) Game 3 (Graph with $X$ , $Y$ , $A$ , $B$ ) d) Game 4 (Graph with $X$ , $Y$ , $A$ , $B$ , $C$ ) e) Game 5 (M-graph structure) f) Game 6 (Graph with $A, B, C, D, E, F, X, Y$ ) g) Game 7 (Graph with $X, Y, A, B$ where B is unobserved)

How do you proceed if $B$ is unobserved?

Problem 3.4¶

Consider the following causal DAG: $U$ (Genotype) $\rightarrow X$ (Smoking) $U$ (Genotype) $\rightarrow Y$ (Lung Cancer) $X$ (Smoking) $\rightarrow Z$ (Tar Deposits) $Z$ (Tar Deposits) $\rightarrow Y$ (Lung Cancer)

where:

$X =$ Smoking {yes, no}
$Z =$ Tar {yes, no}
$Y =$ Lung Cancer {yes, no}
$U =$ (Unobserved) Genotype

We want to compute the Average Causal Effect (ACE):

ACE=P(Y=yes|do(X=yes))-P(Y=yes|do(X=no))

(1)

using the front-door adjustment.

The observed data is summarized in the following tables:

Table 1:

	Tar	No Tar	All Subjects
Smoker	380	20	400
Nonsmoker	20	380	400

Table 2:

	Tar (Smoker)	Tar (Nonsmoker)	No Tar (Smoker)	No Tar (Nonsmoker)
No Cancer	323 (85%)	1 (5%)	18 (90%)	342 (90%)
Cancer	57 (15%)	19 (95%)	2 (10%)	38 (10%)
(Note: Table reconstructed from text context)

a) Compute the conditional probabilities $P(Z|X)$ and $P(Y|X,Z)$ from the data tables. b) Using the front-door adjustment formula, compute $P(Y=yes|do(X=yes))$ and $P(Y=yes|do(X=no))$ step by step. Assume $P(X=yes)=P(X=no)=0.5$ . c) Compute the ACE based on your results from the previous step and interpret the causal effect of smoking on lung cancer in this dataset.

Problem 3.5¶

The Lalonde dataset is a well-known benchmark in causal inference. It contains 445 observations with 12 variables (age, educ, black, hisp, married, nodegr, re74, re75, re78, u74, u75, treat). Our goal is to use the DoWhy library to estimate the causal effect of treatment (treat) on the outcome (re78), while controlling for confounding variables.

a) Import dependencies. Load the required libraries: dowhy, pandas, and numpy. b) Load the Lalonde dataset. c) Specify the Causal Model. Define a CausalModel with: * Treatment: treat * Outcome: re78 * Common causes: ["nodegr", "black", "hisp", "age", "educ", "married"] d) Identify the causal effect. Use DoWhy’s identify_effect function. e) Estimate the causal effect. Estimate the Average Treatment Effect (ATE) using propensity score weighting. Compare this causal estimate with a naive difference in means. f) Use the do method to estimate the Average Treatment Effect. g) Refute the estimate: Perform robustness checks using DoWhy refuters: * Add a random common cause * Replace treatment with a placebo (permute) * Use a random subset of the data

Problem 3.6¶

In this exercise you will gain hands-on experience with causal inference using the DoWhy Python package. We will use the Infant Health and Development Program (IHDP) dataset. You will use DoWhy to estimate the causal effect of the treatment variable (treatment) on the outcome variable (y_factual). Follow the standard causal inference pipeline: model $\rightarrow$ identify $\rightarrow$ estimate $\rightarrow$ refute.

a) Import Dependencies. b) Load the IHDP dataset. c) Specify the Causal Model. Create a CausalModel in DoWhy with treatment, y_factual as outcome, and all covariates x1...x25 as common causes. d) Identify the Estimand. Use the identify_effect function. e) Estimate the Effect. Estimate the Average Treatment Effect (ATE) using the backdoor linear regression method. Compare your estimated causal effect with the raw difference in means. f) Refute the Estimate. Perform robustness checks using at least two refutation methods.

Solutions¶

Solution 3.1 a) The causal diagram variables: $S$ (Smoking), $B$ (Birth Weight), $D$ (Mortality), $U$ (Birth Defect/Unobserved). Graph: $S \rightarrow B$ , $S \rightarrow D$ , $U \rightarrow B$ , $U \rightarrow D$ . b) Collider bias arises when conditioning on low birth weight ( $B$ ) because $B$ is a collider ( $S \rightarrow B \leftarrow U$ ). Conditioning on $B$ opens a back-door path $S \rightarrow B \leftarrow U \rightarrow D$ , creating a spurious correlation between $S$ and $D$ .

Solution 3.2 a) Paradox: Stratified data shows drug is harmful in both subgroups, but aggregated data shows it is beneficial (Simpson’s Paradox). b) Graph: $X \rightarrow Z \rightarrow Y$ and $X \rightarrow Y$ . c) The backdoor criterion does not apply because there is no backdoor path from $X$ to $Y$ (no arrow into $X$ ). $Z$ is a mediator, not a confounder. d) Using front-door adjustment (or direct ACE calculation assuming no unobserved confounding): $P(Y=1|X=0) \approx 0.780$ $P(Y=1|X=1) \approx 0.826$ $ACE \approx 0.046$ . Taking the drug increases recovery probability by 4.6%. f) DoWhy Code:

# Define causal graph
causal_graph = """digraph { x->z; z->y; x->y; }"""
model = CausalModel(data=data, treatment='X', outcome='Y', graph=causal_graph)
# Identify and Estimate
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
estimate = model.estimate_effect(identified_estimand, method_name="backdoor.linear_regression")

Solution 3.3 a) Game 1: No backdoor paths. Adjustment set: $\emptyset$ . b) Game 2: All backdoor paths contain a collider. No adjustment necessary. Adjustment set: $\emptyset$ . c) Game 3: One back-door path $X \leftarrow B \rightarrow Y$ . Adjustment set: $\{B\}$ . d) Game 4: Back-door path blocked by collider $B$ . Adjustment set: $\emptyset$ . e) Game 5: Two backdoor paths. Need to control for $\{A, B\}$ or just $\{C\}$ . f) Game 6: $D$ plus any node in $\{A, B, C\}$ will deconfound. g) Game 7: If $B$ is unobserved, we cannot close the backdoor path $X \leftarrow B \rightarrow Y$ .

Solution 3.4 a) Conditional probabilities calculated from tables. $P(Z=yes|X=yes) = 0.95$ $P(Y=yes|X=yes, Z=yes) = 0.15$ (etc.) b) Front-door adjustment calculation: $P(Y=yes|do(X=yes)) = 0.5475$ $P(Y=yes|do(X=no)) = 0.5025$ c) $ACE = 0.5475 - 0.5025 = 0.045$ . Smoking increases lung cancer probability by 4.5 percentage points.

Solution 3.5 (Lalonde Code)

import dowhy
from dowhy import CausalModel
import dowhy.datasets

# Load data
lalonde = dowhy.datasets.lalonde_dataset()

# Model
common_causes = ["nodegr", "black", "hisp", "age", "educ", "married"]
lalonde_model = CausalModel(
    data=lalonde,
    treatment='treat',
    outcome='re78',
    common_causes=common_causes
)

# Identify
lalonde_identified_estimand = lalonde_model.identify_effect()

# Estimate
lalonde_estimate = lalonde_model.estimate_effect(
    lalonde_identified_estimand,
    method_name="backdoor.propensity_score_weighting"
)
print("Causal Estimate:", lalonde_estimate.value)

# Refute
refute = lalonde_model.refute_estimate(
    lalonde_identified_estimand,
    lalonde_estimate,
    method_name="random_common_cause"
)
print(refute)

Solution 3.6 (IHDP Code)

# Load and process data
cols = ["treatment", "y_factual", "y_cfactual", "mu0", "mu1"] + ["x"+str(i) for i in range(1,26)]
data = pd.read_csv(url, header=None)
data.columns = cols
data = data.astype({"treatment": 'bool'}, copy=False)

# Model
model = CausalModel(
    data=data,
    treatment='treatment',
    outcome='y_factual',
    common_causes=["x"+str(i) for i in range(1,26)]
)

# Identify & Estimate
identified_estimand = model.identify_effect(proceed_when_unidentifiable=True)
estimate = model.estimate_effect(
    identified_estimand,
    method_name="backdoor.linear_regression",
    test_significance=True
)
print("Causal Estimate:", estimate.value)