Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Bayesian Networks - Learning from Data

Learning a Bayesian network can be split into two problems:

  1. Parameter learning: Given a set of data samples and a DAG that captures the dependencies between the variables, estimate the (conditional) probability distributions of the individual variables.

  2. Structure learning: Given a set of data samples, estimate a DAG that captures the dependencies between the variables.

Conditional Independence

In a Bayesian Network, we represent observed variables by shaded nodes (i.e. \textcolor{dodgerblue}{Ⓑ}, all other variables are understood as unobserved (hidden, i.e. ).

Three elementary patterns appear repeatedly in Bayesian networks:

Chain (serial connection)

Fork (diverging connection)

Collider (converging connection)

d-Separation (directional separation)

Think of paths as pipes and statistical dependence as water:

  • If at least one path is open, dependence can flow.

  • If all paths are blocked, then AA and CC are conditionally independent given ZZ.

Testing Conditional Independence in Data

Given a candidate causal DAG GG that could have generated a dataset DD:

  1. Use d-Separation to list all implied conditional independencies (e.g. XZYX \perp Z \mid Y).

  2. For each implied independence, test on data:

    • Use regression or other statistical tools,

    • Check whether the conditional dependence is approximately zero.

For example for the following graph GG:

Loading...
  • Suppose GG implies WZ1XW \perp Z_1 \mid X.

  • Regress WW on XX and Z1Z_1 using w=βXx+βZ1z1w = \beta_X x + \beta_{Z_1} z_1

  • If βZ10\beta_{Z_1} \approx 0 (statistically insignificant), this is consistent with the model.

  • If βZ1\beta_{Z_1} is clearly non-zero, the independence fails in data, so GG must be rejected.

Markov equivalence and equivalence classes

Many different DAGs can induce the same set of conditional independencies. These DAGs are Markov equivalent.

Factorization examples

Consider three variables X,Y,ZX,Y,Z where we know that

XYZ.X \perp Y \mid Z.

Several factorizations of P(X,Y,Z)P(X,Y,Z) are consistent with this:

  • One possible factorization:

    Loading...
    P(x,y,z)=P(yx,z)P(zx)P(x)P(x,y,z) = P(y \mid x,z)\,P(z \mid x)\,P(x)

    If P(yx,z)P(y \mid x,z) does not depend on xx, we can write this as

    Loading...
    P(x,y,z)=P(yz)P(zx)P(x).P(x,y,z) = P(y \mid z)\,P(z \mid x)\,P(x).
  • Another factorization:

    Loading...
    P(x,y,z)=P(xy,z)P(yz)P(z)=P(xz)P(yz)P(z).P(x,y,z) = P(x \mid y,z)\,P(y \mid z)\,P(z) = P(x \mid z)\,P(y \mid z)\,P(z).

Different DAGs correspond to different factorization orders, but if they encode the same conditional independencies (XYZX \perp Y \mid Z), they belong to the same Markov equivalence class.

Parameter sharing

In many models, multiple nodes share the same local conditional distribution. This is called parameter sharing.

Consider a chain of four binary variables:

Loading...

XYZWX \to Y \to Z \to W

Estimating the full joint distribution

  • Full joint distribution P(X,Y,Z,W)P(X,Y,Z,W) has 24=162^4 = 16 entries (15 degrees of freedom)

  • With limited data, many combinations (x,y,z,w)(x,y,z,w) are rare or absent

  • Frequency estimates for these cells are unreliable

Using factorization

P(X,Y,Z,W)=p(X)p(YX)p(ZY)p(WZ)P(X,Y,Z,W) = p(X)\,p(Y \mid X)\,p(Z \mid Y)\,p(W \mid Z)
  • Each factor involves at most 2 variables

  • We estimate each conditional with more data per parameter

  • This yields more stable estimates from the same dataset

    \implies Exploiting the graph structure dramatically reduces the number of parameters and improves estimation in high dimensions.

Hidden Markov Model (HMM)

Loading...
  • Hidden states H1,,HnH_1,\dots,H_n

  • Observations E1,,EnE_1,\dots,E_n

P(H,E)=pstart(h1)i=2nptrans(hihi1)i=1npemit(eihi).P(H,E) = p_{\text{start}}(h_1) \prod_{i=2}^n p_{\text{trans}}(h_i \mid h_{i-1}) \prod_{i=1}^n p_{\text{emit}}(e_i \mid h_i).

Probabilistic Inference

Input:

  • A Bayesian network P(X1,,Xn)P(X_1,\dots,X_n),

  • Evidence E=eE=e for some variables EXE \subseteq X,

  • Query variables QXQ \subseteq X,

Output:

P(QE=e)P(Q=qE=e)for all values qP(Q \mid E=e) \quad \leftrightarrow \quad P(Q = q \mid E = e) \quad \text{for all values q}

The output is a table that specifies a probability for each assignment of values to Q.

Parameter learning with fully observed data

Assume:

  • The BN structure (graph) is known,

  • Training data DtrainD_{\text{train}} consists of full assignments to all variables X1,,XnX_1,\dots,X_n,

  • We want to estimate parameters ϑ\vartheta (local conditional probabilities).

The maximum likelihood objective is

maxϑxDtrainP(x;ϑ)=maxϑxDtraini=1npϑ(xixParents(i)).\max_{\vartheta} \prod_{x \in D_{\text{train}}} P(x; \vartheta) = \max_{\vartheta} \prod_{x \in D_{\text{train}}} \prod_{i=1}^n p_\vartheta\big(x_i \mid x_{\text{Parents}(i)}\big).

Taking logs converts the product into a sum, and the problem decouples across variables and parent configurations.

For each node XiX_i and parent assignment xParents(i)x_{\text{Parents}(i)}, we solve:

maxp(xParents(i))xiN(xi,xParents(i))logp(xixParents(i))\max_{p(\cdot \mid x_{\text{Parents}(i)})} \sum_{x_i} N(x_i, x_{\text{Parents}(i)}) \log p(x_i \mid x_{\text{Parents}(i)})

subject to xip(xixParents(i))=1\sum_{x_i} p(x_i \mid x_{\text{Parents}(i)}) = 1.

Solution:

Maximum likelihood via count-and-normalize

Using pgmpy for MLE

Suppose we have the following data:

Source
data = pd.DataFrame(data={'fruit': ["banana", "apple", "banana", "apple", "banana","apple", "banana",
                                    "apple", "apple", "apple", "banana", "banana", "apple", "banana",],
                          'tasty': ["yes", "no", "yes", "yes", "yes", "yes", "yes",
                                    "yes", "yes", "yes", "yes", "no", "no", "no"],
                          'size': ["large", "large", "large", "small", "large", "large", "large",
                                    "small", "large", "large", "large", "large", "small", "small"]})
data.head()
Loading...

We know that the variables relate as follows:

Source
from pgmpy.models import DiscreteBayesianNetwork

model = DiscreteBayesianNetwork([('fruit', 'tasty'), ('size', 'tasty')])  # fruit -> tasty <- size
graphviz.Source(model.to_graphviz().string())
Loading...

Parameter learning is the task to estimate the values of the conditional probability distributions (CPDs), for the variables fruit, size, and tasty.

State counts

To make sense of the given data, we can start by counting how often each state of the variable occurs. If the variable is dependent on parents, the counts are done conditionally on the parents states, i.e. separately for each parent configuration:

Source
from pgmpy.estimators import ParameterEstimator
pe = ParameterEstimator(model, data)
print("\n", pe.state_counts('fruit'))  # unconditional
print("\n", pe.state_counts('tasty'))  # conditional on fruit and size

         count
fruit        
apple       7
banana      7

 fruit apple       banana      
size  large small  large small
tasty                         
no      1.0   1.0    1.0   1.0
yes     3.0   2.0    5.0   0.0

We can see, for example, that as many apples as bananas were observed and that 5 large bananas were tasty, while only 1 was not.

Maximum Likelihood Estimation

A natural estimate for the CPDs is to simply use the relative frequencies, with which the variable states have occured. We observed 7 apples among a total of 14 fruits, so we might guess that about 50% of fruits are apples.

This approach is Maximum Likelihood Estimation (MLE). According to MLE, we should fill the CPDs in such a way, that P(datamodel)P(\text{data}|\text{model}) is maximal. This is achieved when using the relative frequencies. pgmpy supports MLE as follows:

Source
from pgmpy.estimators import MaximumLikelihoodEstimator
mle = MaximumLikelihoodEstimator(model, data)
print(mle.estimate_cpd('fruit'))  # unconditional
print(mle.estimate_cpd('tasty'))  # conditional
+---------------+-----+
| fruit(apple)  | 0.5 |
+---------------+-----+
| fruit(banana) | 0.5 |
+---------------+-----+
+------------+--------------+-----+---------------+
| fruit      | fruit(apple) | ... | fruit(banana) |
+------------+--------------+-----+---------------+
| size       | size(large)  | ... | size(small)   |
+------------+--------------+-----+---------------+
| tasty(no)  | 0.25         | ... | 1.0           |
+------------+--------------+-----+---------------+
| tasty(yes) | 0.75         | ... | 0.0           |
+------------+--------------+-----+---------------+
  • mle.estimate_cpd(variable) computes the state counts and divides each cell by the (conditional) sample size.

  • The mle.get_parameters()-method returns a list of CPDs for all variables of the model.

  • The built-in fit()-method of BayesianNetwork provides more convenient access to parameter estimators:

# Calibrate all CPDs of `model` using MLE:
model.fit(data, estimator=MaximumLikelihoodEstimator)

print(model.get_cpds('fruit'))
print(model.get_cpds('tasty'))

# Or alternatively, if you want to get all CPDs at once:
# for cpd in model.get_cpds():
#     print(cpd)
+---------------+-----+
| fruit(apple)  | 0.5 |
+---------------+-----+
| fruit(banana) | 0.5 |
+---------------+-----+
+------------+--------------+-----+---------------+
| fruit      | fruit(apple) | ... | fruit(banana) |
+------------+--------------+-----+---------------+
| size       | size(large)  | ... | size(small)   |
+------------+--------------+-----+---------------+
| tasty(no)  | 0.25         | ... | 1.0           |
+------------+--------------+-----+---------------+
| tasty(yes) | 0.75         | ... | 0.0           |
+------------+--------------+-----+---------------+

While very straightforward, the ML estimator has the problem of overfitting to the data. In above CPD, the probability of a large banana being tasty is estimated at 0.833, because 5 out of 6 observed large bananas were tasty. Fine. But note that the probability of a small banana being tasty is estimated at 0.0, because we observed only one small banana and it happened to be not tasty. But that should hardly make us certain that small bananas aren’t tasty!

We simply do not have enough observations to rely on the observed frequencies. If the observed data is not representative for the underlying distribution, ML estimations will be extremly far off.

Laplace smoothing and Dirichlet priors

Pure maximum likelihood can assign zero probability to unseen events:

  • If some (xi,xParents(i))(x_i,x_{\text{Parents}(i)}) combination never appears in training data, then

    p^(xixParents(i))=0.\hat{p}(x_i \mid x_{\text{Parents}(i)}) = 0.

To avoid this, we use Laplace smoothing:

  • Add λ>0\lambda > 0 to each count before normalization.

For discrete XiX_i:

p^(xixParents(i))=λ+N(xi,xParents(i))xi(λ+N(xi,xParents(i))).\hat{p}(x_i \mid x_{\text{Parents}(i)}) = \frac{\lambda + N(x_i, x_{\text{Parents}(i)})} {\sum_{x_i'} \big(\lambda + N(x_i', x_{\text{Parents}(i)})\big)}.

Interpretation

  • Equivalent to a Dirichlet prior on the conditional distribution and doing MAP estimation.

  • Larger λ\lambda produces stronger smoothing (probabilities closer to uniform).

  • As we collect more data, the posterior concentrates around the empirical frequencies.

Smoothing in Python

The Bayesian Parameter Estimator starts with already existing prior CPDs, that express our beliefs about the variables before the data was observed. Those “priors” are then updated, using the state counts from the observed data.

One can think of the priors as consisting in pseudo state counts, that are added to the actual counts before normalization. Unless one wants to encode specific beliefs about the distributions of the variables, one commonly chooses uniform priors, i.e. ones that deem all states equiprobable.

A very simple prior is the so-called K2 prior (Laplace smoothing), which simply adds 1 to the count of every single state. A somewhat more sensible choice of prior is BDeu (Bayesian Dirichlet equivalent uniform prior). For BDeu we need to specify an equivalent sample size N and then the pseudo-counts are the equivalent of having observed N uniform samples of each variable (and each parent configuration):

from pgmpy.estimators import BayesianEstimator
est = BayesianEstimator(model, data)

print(est.estimate_cpd('tasty', prior_type='BDeu', equivalent_sample_size=10))
+------------+---------------------+-----+---------------------+
| fruit      | fruit(apple)        | ... | fruit(banana)       |
+------------+---------------------+-----+---------------------+
| size       | size(large)         | ... | size(small)         |
+------------+---------------------+-----+---------------------+
| tasty(no)  | 0.34615384615384615 | ... | 0.6428571428571429  |
+------------+---------------------+-----+---------------------+
| tasty(yes) | 0.6538461538461539  | ... | 0.35714285714285715 |
+------------+---------------------+-----+---------------------+

The estimated values in the CPDs are now more conservative. In particular, the estimate for a small banana being not tasty is now around 0.64 rather than 1.0. Setting equivalent_sample_size to 10 means that for each parent configuration, we add the equivalent of 10 uniform samples (here: +5 small bananas that are tasty and +5 that aren’t).

BayesianEstimator, too, can be used via the fit()-method:

import numpy as np
import pandas as pd
from pgmpy.models import DiscreteBayesianNetwork
from pgmpy.estimators import BayesianEstimator

# generate data
data = pd.DataFrame(np.random.randint(low=0, high=2, size=(5000, 4)), columns=['A', 'B', 'C', 'D'])
model = DiscreteBayesianNetwork([('A', 'B'), ('A', 'C'), ('D', 'C'), ('B', 'D')])

model.fit(data, estimator=BayesianEstimator, prior_type="BDeu") # default equivalent_sample_size=5
for cpd in model.get_cpds():
    print(cpd)
    break
+------+----------+
| A(0) | 0.497003 |
+------+----------+
| A(1) | 0.502997 |
+------+----------+

Learning with hidden or missing variables

In many applications, some variables are unobserved in the training data.

Example:

  • In a movie-rating model with variables (G,R1,R2)(G,R_1,R_2), the genre GG might be hidden:

    • Observed: ratings (R1,R2)(R_1,R_2),

    • Hidden: GG.

We then aim to maximize the marginal likelihood over observed variables:

maxϑeDtrainp(E=e;ϑ),\max_{\vartheta} \prod_{e \in D_{\text{train}}} p(E=e; \vartheta),

where

p(E=e;ϑ)=hP(H=h,E=e;ϑ)p(E=e; \vartheta) = \sum_h P(H=h, E=e; \vartheta)

sums over hidden variables HH.

Direct optimization is hard because:

  • The sum over hh is inside the product over examples,

  • There is no simple closed-form solution like in the fully observed case.

This motivates the EM algorithm.

Expectation–Maximization (EM) algorithm

EM solves maximum marginal likelihood by iteratively constructing pseudo-complete datasets. It comes with the following guarantees:

  • The marginal log-likelihood does not decrease at each iteration

  • Converges to a local optimum (not necessarily global)

EM in Python

For this example, we simulate some data from the alarm model and use it to learn back the model parameters. In this example, we simply use the structure to the alarm model.

from pgmpy.utils import get_example_model
from pgmpy.models import DiscreteBayesianNetwork

# Load the alarm model and simulate data from it.
alarm_model = get_example_model(model="alarm")
samples = alarm_model.simulate(n_samples=int(1e3))

print(samples.head())

# Define a new model with the same structure as the alarm model.
new_model = DiscreteBayesianNetwork(ebunch=alarm_model.edges())
Output
  ERRCAUTER  ARTCO2 VENTALV VENTLUNG  SAO2 PULMEMBOLUS      BP HYPOVOLEMIA  \
0      TRUE     LOW    HIGH      LOW  HIGH       FALSE    HIGH       FALSE   
1     FALSE  NORMAL  NORMAL     ZERO   LOW       FALSE     LOW       FALSE   
2     FALSE    HIGH    ZERO     ZERO   LOW       FALSE     LOW        TRUE   
3     FALSE    HIGH    ZERO     ZERO   LOW       FALSE  NORMAL        TRUE   
4     FALSE    HIGH    ZERO     ZERO   LOW       FALSE     LOW       FALSE   

      CVP  HRBP  ...     TPR  MINVOL INTUBATION PVSAT MINVOLSET      CO HREKG  \
0  NORMAL  HIGH  ...  NORMAL    HIGH     NORMAL  HIGH    NORMAL    HIGH   LOW   
1  NORMAL  HIGH  ...     LOW  NORMAL   ONESIDED   LOW    NORMAL    HIGH  HIGH   
2    HIGH  HIGH  ...     LOW    ZERO     NORMAL   LOW    NORMAL  NORMAL  HIGH   
3    HIGH  HIGH  ...  NORMAL    ZERO     NORMAL   LOW    NORMAL  NORMAL  HIGH   
4     LOW  HIGH  ...     LOW    ZERO     NORMAL   LOW    NORMAL    HIGH  HIGH   

  LVEDVOLUME    FIO2 STROKEVOLUME  
0     NORMAL  NORMAL       NORMAL  
1     NORMAL  NORMAL       NORMAL  
2       HIGH  NORMAL          LOW  
3       HIGH  NORMAL          LOW  
4     NORMAL     LOW       NORMAL  

[5 rows x 37 columns]

The Expectation Maximization (EM) estimator can work in the case when latent variables are present in the model. To simulate this scenario, we will specify some of the variables in our new_model as latents and drop those variables from samples to simulate missing data.

model_latent = DiscreteBayesianNetwork(alarm_model.edges(), latents={'HISTORY', 'CVP'})
samples_latent = samples.drop(['HISTORY', 'CVP'], axis=1)
from pgmpy.estimators import ExpectationMaximization as EM

# Initialize the EM estimator
em_est = EM(model=model_latent, data=samples_latent)

# Iterate through CPDs generated by the model
for cpd in em_est.get_parameters():
    print(f"Conditional Probability Distribution for {cpd.variable}:")
    print(cpd)
    break
Conditional Probability Distribution for ERRCAUTER:
+------------------+-------+
| ERRCAUTER(FALSE) | 0.887 |
+------------------+-------+
| ERRCAUTER(TRUE)  | 0.113 |
+------------------+-------+

Markov networks and factor graphs

A Markov network (also called a Markov random field) is an undirected graphical model that represents a joint distribution over X=(X1,,Xn)X = (X_1,\dots,X_n) via factors.

We define:

  • A set of factors f1,,fmf_1,\dots,f_m, each depending on a subset of variables.

  • For any assignment xx, a weight

    Weight(x)=j=1mfj(x).\text{Weight}(x) = \prod_{j=1}^m f_j(x).

The normalized joint distribution is

P(X=x)=Weight(x)Z,Z=xWeight(x)P(X=x) = \frac{\text{Weight}(x)}{Z}, \qquad Z = \sum_{x'} \text{Weight}(x')

where ZZ is the partition function.

A factor graph is a bipartite graph with:

  • Variable nodes XiX_i,

  • Factor nodes fjf_j,

  • Edges connecting factors to the variables they depend on.

Factor graphs generalize Bayesian networks and are convenient for describing models like object tracking, image models, and general constraint systems.

Object tracking: MAP trajectory vs marginal probabilities

In the object tracking example:

  • X1,X2,X3X_1,X_2,X_3 are discrete positions,

  • Observation factors oi(xi)o_i(x_i) reflect sensor likelihoods,

  • Transition factors ti(xi,xi+1)t_i(x_i,x_{i+1}) enforce smooth motion.

The factor graph defines the weight

Weight(x1,x2,x3)=o1(x1)o2(x2)o3(x3)t1(x1,x2)t2(x2,x3).\text{Weight}(x_1,x_2,x_3) = o_1(x_1)\,o_2(x_2)\,o_3(x_3)\,t_1(x_1,x_2)\,t_2(x_2,x_3).

Two different tasks:

  1. MAP assignment (most likely trajectory):

    (x1,x2,x3)=argmaxx1,x2,x3Weight(x1,x2,x3).(x_1^{*},x_2^{*},x_3^{*}) = \arg\max_{x_1,x_2,x_3} \text{Weight}(x_1,x_2,x_3).
  2. Marginal probabilities:

    P(X2=v)=x1,x3P(X1=x1,X2=v,X3=x3)=1Zx1,x3Weight(x1,v,x3).P(X_2 = v) = \sum_{x_1,x_3} P(X_1=x_1,X_2=v,X_3=x_3) = \frac{1}{Z} \sum_{x_1,x_3} \text{Weight}(x_1,v,x_3).

Lesson:

  • The most probable trajectory may place the object at one position at time 2,

  • While the marginal P(X2=v)P(X_2=v) may be higher for another position when summing over all trajectories.

Thus, returning only the MAP trajectory does not convey uncertainty about individual positions.

Gibbs sampling for approximate marginals

Computing exact marginals

P(Xi=v)=x:xi=vP(X=x)P(X_i = v) = \sum_{x : x_i = v} P(X = x)

is often infeasible in large models. Gibbs sampling approximates these marginals via a Markov chain.

Algorithm (high level):

  1. Initialize a complete assignment x(0)x^{(0)}.

  2. For t=0,1,2,t = 0,1,2,\dots:

    • Choose a variable index ii,

    • Sample a new value for XiX_i from the conditional

      Xi(t+1)P(XiXi=xi(t)),X_i^{(t+1)} \sim P\big(X_i \mid X_{-i} = x_{-i}^{(t)}\big),

      where XiX_{-i} denotes all variables except XiX_i.

  3. After a burn-in period, collect samples x(t)x^{(t)} and estimate marginals by frequencies:

    P^(Xi=v)=#{t:Xi(t)=v}number of collected samples.\hat{P}(X_i = v) = \frac{\#\{t: X_i^{(t)} = v\}}{\text{number of collected samples}}.

Key properties:

  • Gibbs sampling is a form of Markov Chain Monte Carlo (MCMC).

  • Under mild conditions (irreducibility, aperiodicity), the chain converges to the true joint distribution.

  • It only needs local computations (looking at factors involving XiX_i), making it scalable to large models.

The slides illustrate Gibbs sampling on the object tracking / Markov network example.