Bayesian Networks II

Problem 2.1¶

Consider the following joint distribution for three binary random variables $a, b, c \in \{0,1\}$ :

b	a	c	$p(a,b,c)$
0	0	0	0.192
0	0	1	0.144
0	1	0	0.048
0	1	1	0.216
1	0	0	0.192
1	0	1	0.064
1	1	0	0.048
1	1	1	0.096

(a) Show that $a$ and $b$ are dependent, i.e. $p(a,b) \ne p(a)p(b)$ . (b) Show that $a$ and $b$ are conditionally independent given $c$ , i.e.

p(a,b|c)=p(a|c)p(b|c)

(1)

Problem 2.2¶

Conditional independence properties entailed by a Bayesian Network can be read directly from the graph, using the notion of d-separation. Given the following Bayesian Network (Graph nodes: A, B, C, D, E, F, G, H, I, J), which of the following conditional independence statements hold?

a) $A \perp F$ b) $A \perp G$ c) $B \perp I | F$ d) $D \perp J | G,H$ e) $I \perp B | H$ f) $J \perp D$ g) $I \perp C | H,F$

Problem 2.3¶

In the Cyanobacteria example, we have the following variables: $\mathcal{U}=\{T,F,C,M,W\}$ where

Temperature: $T \in \{cold, hot\}$
Presence of fertilizer in water: $F \in \{yes, no\}$
Presence of cyanobacteria in water: $C \in \{yes, no\}$
Fish mortality: $M \in \{yes, no\}$
Water color: $W \in \{clear, green\}$

The joint factorization is

\mathbb{P}(T,F,C,M,W)=p(t)p(f)p(c|t,f)p(w|c)p(m|c)

(2)

Assume the probability distributions are given by: $p(t=cold)=0.4, \quad p(t=hot)=0.6$ $p(f=yes)=0.2, \quad p(f=no)=0.8$

The conditional probability table (CPT) for $p(c|t,f)$ is:

	$t=cold$	$t=hot$
$f=yes$	$p(c=yes)=0.5$	$p(c=yes)=0.95$
$f=no$	$p(c=yes)=0.05$	$p(c=yes)=0.8$

The conditional probability table (CPT) for $p(m|c)$ is:

	$c=yes$	$c=no$
$m=yes$	0.6	0.1
$m=no$	0.4	0.9

The conditional probability table (CPT) for $p(w|c)$ is:

	$c=yes$	$c=no$
$w=clear$	0.7	0.2
$w=green$	0.3	0.8

We want to compute the posterior probability of fish mortality given colored water:

p(m|w=green)=\frac{p(m,w=green)}{p(w=green)}

(3)

a) Eliminate variables sequentially by working with its chain-rule factorization. The joint probability is

\mathbb{P}(F,T,W,M,C)=p(t)p(f)p(c|t,f)p(w|c)p(m|c)

(4)

where computing $p(m|w=green)$ corresponds to

p(m|w=green)=\sum_{t,f,c}p(t)p(f)p(c|t,f)p(w=green|c)p(m|c)

(5)

Express the procedure how to compute $p(m|w=green)$ by indicating the dimensions of the conditional probability tables.

b) Compute the conditional probability table (CPT) for $p(c,f|t)$ . c) Compute the conditional probability table (CPT) for $p(c|t)$ . d) Compute the conditional probability table (CPT) for $p(c,t)$ . e) Compute the probability $p(c)$ . f) Compute the conditional probability table (CPT) for $p(m,c)$ . g) Compute the conditional probability table (CPT) for $p(m, w=green, c)$ . h) Compute the probability table for $p(m,w=green)$ . i) Compute $p(w=green)$ . j) Compute the conditional probability $p(m|w=green)$ . k) Compute the conditional probability $p(m|w=green)$ by means of pgmpy.

l) MLE and MAP (Laplace smoothing): For the conditional probability $p(c=yes|t=hot,f=yes)$ : (i) give the maximum-likelihood estimator (MLE) in terms of counts, and (ii) give the MAP estimator using Laplace smoothing (add-one).

m) Handling missing entries - expected counts for MAP: Given the dataset with missing entries (observations $\mathcal{D}_{i}$ ), write the symbolic expression for the expected count $\mathbb{E}[\#\{c = yes, t = hot, f=yes\}]$ used in the expectation-maximization / MAP step. Explain briefly how this expectation is computed for: (i) an observation with no missing entries, (ii) an observation with some missing variables (show one short example).

Problem 2.4¶

Suppose you wish to perform variable elimination on the Bayesian Network shown in the graph (Nodes: A, B, C, D, E, F, G, H, I, J). Consider the following variable elimination ordering: A, B, C, D, E, F, G, H, I, J.

For each iteration of the algorithm (i.e., for each variable in the ordering), determine which factors are removed and which new factors are introduced. As an example, in the first iteration (eliminating variable A), the factors $P(A)$ and $P(B|A)$ are removed, and a new factor $g_{1}(B)$ is introduced.

Problem 2.5¶

In this exercise, you will use the variable elimination algorithm to perform inference on a Bayesian Network. Consider the network with nodes A, B, C, D, E, F and the corresponding CPTs:

$P(A=t)=0.3$ $P(C=t)=0.6$

(a) $P(B|A,C)$ :

A	C	$P(B=t)$
f	f	0.2
f	t	0.8
t	f	0.3
t	t	0.5

(b) $P(D|C)$ :

C	$P(D=t)$
f	0.9
t	0.75

(c) $P(E|B)$ :

B	$P(E=t)$
f	0.2
t	0.4

(d) $P(F|D,E)$ :

D	E	$P(F=t)$
f	f	0.95
f	t	1.00
t	f	0.00
t	t	0.25

Assuming a query on A with evidence for B and D, i.e. computing $P(A|B,D)$ , use the variable elimination algorithm to answer the following queries: a) $P(A=t|B=t,D=f)$ b) $P(A=f|B=f,D=f)$ c) $P(A=t|B=t,D=t)$

Consider now the variable elimination ordering: C, E, F, D, B, A. Use again the variable elimination algorithm and write down the intermediate factors, this time without computing their probability tables. Discuss whether this ordering is better or worse than the one used previously, and explain why.

Problem 2.6¶

In the Cyanobacteria example (variables $\mathcal{U}=\{T,F,C,M,W\}$ ), the joint factorization is:

\mathbb{P}(T,F,C,M,W)=p(t)p(f)p(c|t,f)p(w|c)p(m|c)

(6)

a) For the conditional probability $p(c=yes|t=hot,f=yes)$ : (i) give the maximum-likelihood estimator (MLE) in terms of counts. (ii) give the MAP estimator using Laplace smoothing (add-one).

b) We have a partially observed Bayesian Network. We want to estimate $p(c=yes|t=hot,f=yes)$ . Estimate the count by the expected count $\mathbb{E}(\#\{c=yes,t=hot,f=yes\})$ . (i) For $\mathcal{D}_{1} = (cold, ?, yes, ?, clear)$ , compute $p(c=yes,t=hot, f=yes|\mathcal{D}_{1})$ . (ii) For $\mathcal{D}_{2} = (?, ?, yes, yes, clear)$ , compute $p(c=yes, t=hot, f=yes|\mathcal{D}_{2})$ . (iii) For $\mathcal{D}_{4} = (hot, yes, yes, yes, green)$ , compute $p(c=yes, t=hot, f=yes|\mathcal{D}_{4})$ . (iv) Indicate in symbolic notation the MAP using the expected counts obtained.

c) Read in cyanobacteria_data.csv and learn the parameters using pgmpy. (i) Learn parameters without smoothing. (ii) Learn parameters with smoothing. Predict $p(M|W=green)$ .

d) Assume a dataset with unobserved fertilizer (F): cyanobacteria_unobserved_fertilizer.csv. Using Expectation Maximization (EM), estimate the values of the unobserved fertilizer variable using pgmpy. Then predict $p(M|W=green)$ .

Problem 2.7¶

Consider the following causal DAG: $U$ (Genotype) $\rightarrow X$ (Smoking) $U$ (Genotype) $\rightarrow Y$ (Lung Cancer) $X$ (Smoking) $\rightarrow Z$ (Tar Deposits) $Z$ (Tar Deposits) $\rightarrow Y$ (Lung Cancer)

We define the joint distribution $\mathbb{P}(X,Z,Y)=P(X)P(Z|X)P(Y|Z)$ (marginalizing over U implicitly in the provided CPTs for the exercise):

$P(X=1)=0.5, P(X=0)=0.5$

$P(Z|X)$ :

	Z=0	Z=1
X=0	0.95	0.05
X=1	0.05	0.95

$P(Y|Z)$ :

	Y=0	Y=1
Z=0	0.14	0.86
Z=1	0.24	0.76

a) Construct the Bayesian network in pgmpy. b) Learn the parameters from the given CPTs. c) Compute the probabilistic (associational) query $P(Y=1|X=1)$ . d) (Optional) Explain briefly why this differs from the causal effect of smoking on lung cancer.

Solutions¶

Solution 2.1 a) & b) Calculations provided showing $p(a,b) \neq p(a)p(b)$ but $p(a,b|c) = p(a|c)p(b|c)$ .

Solution 2.2 a) No, b) Yes, c) No, d) Yes, e) Yes, f) No, g) No.

Solution 2.3 (a - k) Calculations for Variable Elimination steps provided. Result for j: $p(m=yes|w=green) = 0.26$ , $p(m=no|w=green) = 0.74$ .

Solution 2.6 (Python Code Snippet)

# Compute p(m | w = "green") using pgmpy
from pgmpy.models import DiscreteBayesianNetwork
from pgmpy.factors.discrete import TabularCPD
from pgmpy.inference import VariableElimination

# 1. Define the Bayesian Network structure
model = DiscreteBayesianNetwork([
    ('T', 'C'),
    ('F', 'C'),
    ('C', 'M'),
    ('C', 'W')
])

# 2. Define the Conditional Probability Tables (CPTs)
cpd_T = TabularCPD(variable='T', variable_card=2, values=[[0.4], [0.6]], state_names={'T': ['cold', 'hot']})
cpd_F = TabularCPD(variable='F', variable_card=2, values=[[0.2], [0.8]], state_names={'F': ['yes', 'no']})
cpd_C = TabularCPD(variable='C', variable_card=2,
                   values=[[0.5, 0.05, 0.95, 0.8], [0.5, 0.95, 0.05, 0.2]],
                   evidence=['T', 'F'], evidence_card=[2, 2],
                   state_names={'C': ['yes', 'no'], 'T': ['cold', 'hot'], 'F': ['yes', 'no']})
cpd_M = TabularCPD(variable='M', variable_card=2,
                   values=[[0.6, 0.1], [0.4, 0.9]],
                   evidence=['C'], evidence_card=[2],
                   state_names={'M': ['yes', 'no'], 'C': ['yes', 'no']})
cpd_W = TabularCPD(variable='W', variable_card=2,
                   values=[[0.7, 0.2], [0.3, 0.8]],
                   evidence=['C'], evidence_card=[2],
                   state_names={'W': ['clear', 'green'], 'C': ['yes', 'no']})

# 3. Add CPTs to the model
model.add_cpds(cpd_T, cpd_F, cpd_C, cpd_M, cpd_W)
assert model.check_model()

# 4. Perform inference
infer = VariableElimination(model)
posterior = infer.query(variables=['M'], evidence={'W': 'green'})
print(posterior)

Solution 2.7 (Python Code Snippet)

from pgmpy.models import DiscreteBayesianNetwork
from pgmpy.factors.discrete import TabularCPD
from pgmpy.inference import VariableElimination

# Define structure
model = DiscreteBayesianNetwork([("X", "Z"), ("Z", "Y")])

# Define CPTS
cpd_x = TabularCPD("X", 2, [[0.5], [0.5]])
cpd_z = TabularCPD("Z", 2, [[0.95, 0.05], [0.05, 0.95]], evidence=["X"], evidence_card=[2])
cpd_y = TabularCPD("Y", 2, [[0.14, 0.24], [0.86, 0.76]], evidence=["Z"], evidence_card=[2])

# Add to model
model.add_cpds(cpd_x, cpd_z, cpd_y)
model.check_model()

# Inference
infer = VariableElimination(model)
q = infer.query(variables=["Y"], evidence={"X": 1})
# Output: P(Y=1|X=1)=0.73