Bayesian Networks - Learning from Data
Learning a Bayesian network can be split into two problems:
Parameter learning: Given a set of data samples and a DAG that captures the dependencies between the variables, estimate the (conditional) probability distributions of the individual variables.
Structure learning: Given a set of data samples, estimate a DAG that captures the dependencies between the variables.
Conditional Independence¶
In a Bayesian Network, we represent observed variables by shaded nodes (i.e. , all other variables are understood as unobserved (hidden, i.e. ).
Three elementary patterns appear repeatedly in Bayesian networks:
Chain (serial connection)¶
Fork (diverging connection)¶
Collider (converging connection)¶
d-Separation (directional separation)¶
Think of paths as pipes and statistical dependence as water:
If at least one path is open, dependence can flow.
If all paths are blocked, then and are conditionally independent given .
Testing Conditional Independence in Data¶
Given a candidate causal DAG that could have generated a dataset :
Use d-Separation to list all implied conditional independencies (e.g. ).
For each implied independence, test on data:
Use regression or other statistical tools,
Check whether the conditional dependence is approximately zero.
For example for the following graph :
Suppose implies .
Regress on and using
If (statistically insignificant), this is consistent with the model.
If is clearly non-zero, the independence fails in data, so must be rejected.
Markov equivalence and equivalence classes¶
Many different DAGs can induce the same set of conditional independencies. These DAGs are Markov equivalent.
Factorization examples¶
Consider three variables where we know that
Several factorizations of are consistent with this:
One possible factorization:
Loading...If does not depend on , we can write this as
Loading...Another factorization:
Loading...
Different DAGs correspond to different factorization orders, but if they encode the same conditional independencies (), they belong to the same Markov equivalence class.
Parameter sharing¶
In many models, multiple nodes share the same local conditional distribution. This is called parameter sharing.
Consider a chain of four binary variables:
Estimating the full joint distribution
Full joint distribution has entries (15 degrees of freedom)
With limited data, many combinations are rare or absent
Frequency estimates for these cells are unreliable
Using factorization
Each factor involves at most 2 variables
We estimate each conditional with more data per parameter
This yields more stable estimates from the same dataset
Exploiting the graph structure dramatically reduces the number of parameters and improves estimation in high dimensions.
Probabilistic Inference¶
Input:
A Bayesian network ,
Evidence for some variables ,
Query variables ,
Output:
The output is a table that specifies a probability for each assignment of values to Q.
Parameter learning with fully observed data¶
Assume:
The BN structure (graph) is known,
Training data consists of full assignments to all variables ,
We want to estimate parameters (local conditional probabilities).
The maximum likelihood objective is
Taking logs converts the product into a sum, and the problem decouples across variables and parent configurations.
For each node and parent assignment , we solve:
subject to .
Solution:
Maximum likelihood via count-and-normalize¶
Using pgmpy for MLE¶
Suppose we have the following data:
Source
data = pd.DataFrame(data={'fruit': ["banana", "apple", "banana", "apple", "banana","apple", "banana",
"apple", "apple", "apple", "banana", "banana", "apple", "banana",],
'tasty': ["yes", "no", "yes", "yes", "yes", "yes", "yes",
"yes", "yes", "yes", "yes", "no", "no", "no"],
'size': ["large", "large", "large", "small", "large", "large", "large",
"small", "large", "large", "large", "large", "small", "small"]})
data.head()We know that the variables relate as follows:
Source
from pgmpy.models import DiscreteBayesianNetwork
model = DiscreteBayesianNetwork([('fruit', 'tasty'), ('size', 'tasty')]) # fruit -> tasty <- size
graphviz.Source(model.to_graphviz().string())Parameter learning is the task to estimate the values of the conditional probability distributions (CPDs), for the variables fruit, size, and tasty.
State counts¶
To make sense of the given data, we can start by counting how often each state of the variable occurs. If the variable is dependent on parents, the counts are done conditionally on the parents states, i.e. separately for each parent configuration:
Source
from pgmpy.estimators import ParameterEstimator
pe = ParameterEstimator(model, data)
print("\n", pe.state_counts('fruit')) # unconditional
print("\n", pe.state_counts('tasty')) # conditional on fruit and size
count
fruit
apple 7
banana 7
fruit apple banana
size large small large small
tasty
no 1.0 1.0 1.0 1.0
yes 3.0 2.0 5.0 0.0
We can see, for example, that as many apples as bananas were observed and that 5 large bananas were tasty, while only 1 was not.
Maximum Likelihood Estimation¶
A natural estimate for the CPDs is to simply use the relative frequencies, with which the variable states have occured. We observed 7 apples among a total of 14 fruits, so we might guess that about 50% of fruits are apples.
This approach is Maximum Likelihood Estimation (MLE). According to MLE, we should fill the CPDs in such a way, that is maximal. This is achieved when using the relative frequencies. pgmpy supports MLE as follows:
Source
from pgmpy.estimators import MaximumLikelihoodEstimator
mle = MaximumLikelihoodEstimator(model, data)
print(mle.estimate_cpd('fruit')) # unconditional
print(mle.estimate_cpd('tasty')) # conditional+---------------+-----+
| fruit(apple) | 0.5 |
+---------------+-----+
| fruit(banana) | 0.5 |
+---------------+-----+
+------------+--------------+-----+---------------+
| fruit | fruit(apple) | ... | fruit(banana) |
+------------+--------------+-----+---------------+
| size | size(large) | ... | size(small) |
+------------+--------------+-----+---------------+
| tasty(no) | 0.25 | ... | 1.0 |
+------------+--------------+-----+---------------+
| tasty(yes) | 0.75 | ... | 0.0 |
+------------+--------------+-----+---------------+
mle.estimate_cpd(variable)computes the state counts and divides each cell by the (conditional) sample size.The
mle.get_parameters()-method returns a list of CPDs for all variables of the model.The built-in
fit()-method ofBayesianNetworkprovides more convenient access to parameter estimators:
# Calibrate all CPDs of `model` using MLE:
model.fit(data, estimator=MaximumLikelihoodEstimator)
print(model.get_cpds('fruit'))
print(model.get_cpds('tasty'))
# Or alternatively, if you want to get all CPDs at once:
# for cpd in model.get_cpds():
# print(cpd)+---------------+-----+
| fruit(apple) | 0.5 |
+---------------+-----+
| fruit(banana) | 0.5 |
+---------------+-----+
+------------+--------------+-----+---------------+
| fruit | fruit(apple) | ... | fruit(banana) |
+------------+--------------+-----+---------------+
| size | size(large) | ... | size(small) |
+------------+--------------+-----+---------------+
| tasty(no) | 0.25 | ... | 1.0 |
+------------+--------------+-----+---------------+
| tasty(yes) | 0.75 | ... | 0.0 |
+------------+--------------+-----+---------------+
While very straightforward, the ML estimator has the problem of overfitting to the data. In above CPD, the probability of a large banana being tasty is estimated at 0.833, because 5 out of 6 observed large bananas were tasty. Fine. But note that the probability of a small banana being tasty is estimated at 0.0, because we observed only one small banana and it happened to be not tasty. But that should hardly make us certain that small bananas aren’t tasty!
We simply do not have enough observations to rely on the observed frequencies. If the observed data is not representative for the underlying distribution, ML estimations will be extremly far off.
Laplace smoothing and Dirichlet priors¶
Pure maximum likelihood can assign zero probability to unseen events:
If some combination never appears in training data, then
To avoid this, we use Laplace smoothing:
Add to each count before normalization.
For discrete :
Interpretation
Equivalent to a Dirichlet prior on the conditional distribution and doing MAP estimation.
Larger produces stronger smoothing (probabilities closer to uniform).
As we collect more data, the posterior concentrates around the empirical frequencies.
Smoothing in Python¶
The Bayesian Parameter Estimator starts with already existing prior CPDs, that express our beliefs about the variables before the data was observed. Those “priors” are then updated, using the state counts from the observed data.
One can think of the priors as consisting in pseudo state counts, that are added to the actual counts before normalization. Unless one wants to encode specific beliefs about the distributions of the variables, one commonly chooses uniform priors, i.e. ones that deem all states equiprobable.
A very simple prior is the so-called K2 prior (Laplace smoothing), which simply adds 1 to the count of every single state.
A somewhat more sensible choice of prior is BDeu (Bayesian Dirichlet equivalent uniform prior). For BDeu we need to specify an equivalent sample size N and then the pseudo-counts are the equivalent of having observed N uniform samples of each variable (and each parent configuration):
from pgmpy.estimators import BayesianEstimator
est = BayesianEstimator(model, data)
print(est.estimate_cpd('tasty', prior_type='BDeu', equivalent_sample_size=10))+------------+---------------------+-----+---------------------+
| fruit | fruit(apple) | ... | fruit(banana) |
+------------+---------------------+-----+---------------------+
| size | size(large) | ... | size(small) |
+------------+---------------------+-----+---------------------+
| tasty(no) | 0.34615384615384615 | ... | 0.6428571428571429 |
+------------+---------------------+-----+---------------------+
| tasty(yes) | 0.6538461538461539 | ... | 0.35714285714285715 |
+------------+---------------------+-----+---------------------+
The estimated values in the CPDs are now more conservative. In particular, the estimate for a small banana being not tasty is now around 0.64 rather than 1.0. Setting equivalent_sample_size to 10 means that for each parent configuration, we add the equivalent of 10 uniform samples (here: +5 small bananas that are tasty and +5 that aren’t).
BayesianEstimator, too, can be used via the fit()-method:
import numpy as np
import pandas as pd
from pgmpy.models import DiscreteBayesianNetwork
from pgmpy.estimators import BayesianEstimator
# generate data
data = pd.DataFrame(np.random.randint(low=0, high=2, size=(5000, 4)), columns=['A', 'B', 'C', 'D'])
model = DiscreteBayesianNetwork([('A', 'B'), ('A', 'C'), ('D', 'C'), ('B', 'D')])
model.fit(data, estimator=BayesianEstimator, prior_type="BDeu") # default equivalent_sample_size=5
for cpd in model.get_cpds():
print(cpd)
break+------+----------+
| A(0) | 0.497003 |
+------+----------+
| A(1) | 0.502997 |
+------+----------+
Learning with hidden or missing variables¶
In many applications, some variables are unobserved in the training data.
Example:
In a movie-rating model with variables , the genre might be hidden:
Observed: ratings ,
Hidden: .
We then aim to maximize the marginal likelihood over observed variables:
where
sums over hidden variables .
Direct optimization is hard because:
The sum over is inside the product over examples,
There is no simple closed-form solution like in the fully observed case.
This motivates the EM algorithm.
Expectation–Maximization (EM) algorithm¶
EM solves maximum marginal likelihood by iteratively constructing pseudo-complete datasets. It comes with the following guarantees:
The marginal log-likelihood does not decrease at each iteration
Converges to a local optimum (not necessarily global)
EM in Python¶
For this example, we simulate some data from the alarm model and use it to learn back the model parameters. In this example, we simply use the structure to the alarm model.
from pgmpy.utils import get_example_model
from pgmpy.models import DiscreteBayesianNetwork
# Load the alarm model and simulate data from it.
alarm_model = get_example_model(model="alarm")
samples = alarm_model.simulate(n_samples=int(1e3))
print(samples.head())
# Define a new model with the same structure as the alarm model.
new_model = DiscreteBayesianNetwork(ebunch=alarm_model.edges())Output
ERRCAUTER ARTCO2 VENTALV VENTLUNG SAO2 PULMEMBOLUS BP HYPOVOLEMIA \
0 TRUE LOW HIGH LOW HIGH FALSE HIGH FALSE
1 FALSE NORMAL NORMAL ZERO LOW FALSE LOW FALSE
2 FALSE HIGH ZERO ZERO LOW FALSE LOW TRUE
3 FALSE HIGH ZERO ZERO LOW FALSE NORMAL TRUE
4 FALSE HIGH ZERO ZERO LOW FALSE LOW FALSE
CVP HRBP ... TPR MINVOL INTUBATION PVSAT MINVOLSET CO HREKG \
0 NORMAL HIGH ... NORMAL HIGH NORMAL HIGH NORMAL HIGH LOW
1 NORMAL HIGH ... LOW NORMAL ONESIDED LOW NORMAL HIGH HIGH
2 HIGH HIGH ... LOW ZERO NORMAL LOW NORMAL NORMAL HIGH
3 HIGH HIGH ... NORMAL ZERO NORMAL LOW NORMAL NORMAL HIGH
4 LOW HIGH ... LOW ZERO NORMAL LOW NORMAL HIGH HIGH
LVEDVOLUME FIO2 STROKEVOLUME
0 NORMAL NORMAL NORMAL
1 NORMAL NORMAL NORMAL
2 HIGH NORMAL LOW
3 HIGH NORMAL LOW
4 NORMAL LOW NORMAL
[5 rows x 37 columns]
The Expectation Maximization (EM) estimator can work in the case when latent variables are present in the model. To simulate this scenario, we will specify some of the variables in our new_model as latents and drop those variables from samples to simulate missing data.
model_latent = DiscreteBayesianNetwork(alarm_model.edges(), latents={'HISTORY', 'CVP'})
samples_latent = samples.drop(['HISTORY', 'CVP'], axis=1)from pgmpy.estimators import ExpectationMaximization as EM
# Initialize the EM estimator
em_est = EM(model=model_latent, data=samples_latent)
# Iterate through CPDs generated by the model
for cpd in em_est.get_parameters():
print(f"Conditional Probability Distribution for {cpd.variable}:")
print(cpd)
breakConditional Probability Distribution for ERRCAUTER:
+------------------+-------+
| ERRCAUTER(FALSE) | 0.887 |
+------------------+-------+
| ERRCAUTER(TRUE) | 0.113 |
+------------------+-------+
Markov networks and factor graphs¶
A Markov network (also called a Markov random field) is an undirected graphical model that represents a joint distribution over via factors.
We define:
A set of factors , each depending on a subset of variables.
For any assignment , a weight
The normalized joint distribution is
where is the partition function.
A factor graph is a bipartite graph with:
Variable nodes ,
Factor nodes ,
Edges connecting factors to the variables they depend on.
Factor graphs generalize Bayesian networks and are convenient for describing models like object tracking, image models, and general constraint systems.
Object tracking: MAP trajectory vs marginal probabilities¶
In the object tracking example:
are discrete positions,
Observation factors reflect sensor likelihoods,
Transition factors enforce smooth motion.
The factor graph defines the weight
Two different tasks:
MAP assignment (most likely trajectory):
Marginal probabilities:
Lesson:
The most probable trajectory may place the object at one position at time 2,
While the marginal may be higher for another position when summing over all trajectories.
Thus, returning only the MAP trajectory does not convey uncertainty about individual positions.
Gibbs sampling for approximate marginals¶
Computing exact marginals
is often infeasible in large models. Gibbs sampling approximates these marginals via a Markov chain.
Algorithm (high level):
Initialize a complete assignment .
For :
Choose a variable index ,
Sample a new value for from the conditional
where denotes all variables except .
After a burn-in period, collect samples and estimate marginals by frequencies:
Key properties:
Gibbs sampling is a form of Markov Chain Monte Carlo (MCMC).
Under mild conditions (irreducibility, aperiodicity), the chain converges to the true joint distribution.
It only needs local computations (looking at factors involving ), making it scalable to large models.
The slides illustrate Gibbs sampling on the object tracking / Markov network example.