Probability Theory and Bayes’ Theorem
Random experiments, events, and random variables¶
A random experiment produces outcomes from a sample space of elementary events.
Example: rolling a fair die.
An event is a subset of .
Example: “number is larger than 3”
A random variable is a numerical quantity whose value depends on the outcome of the random experiment.
Examples:
: number of eyes shown after a dice roll (values in ).
: outcome of a coin toss, represented as (e.g. , ).
Probability and the Laplace model¶
The probability of an event satisfies
and quantifies how likely it is that will occur in the future. A very simple model is the Laplace model (principle of indifference):
The Laplace model is simple but limited: many real problems have outcomes with different probabilities, so we need richer probability models.
Probability mass function (PMF)¶
For a discrete random variable , the probability mass function (PMF) assigns a probability to each possible value :
The PMF must satisfy:
Non-negativity:
Normalization:
The expectation (mean) of is
and the variance is
For continuous random variables, the PMF is replaced by a probability density function (PDF), and sums become integrals.
Sampling vs. inference — the magic coin and the binomial distribution¶
Consider a magic coin with probability of showing heads on any toss.
Sampling problem¶
If we know , what is the probability that we observe exactly heads in tosses?
The answer is given by the binomial distribution:
This is a sampling question: given , what does the data look like?
Inference problem¶
We toss the coin times and observe heads.
What can we say about the unknown probability ?
This is an inference question: from observed data we want to learn about an unknown parameter .
Frequentist approach: estimate with a point estimate such as the maximum likelihood estimator (MLE).
Bayesian approach: treat itself as random, assign a prior belief, and update it using Bayes’ theorem to get a posterior.
Frequentist view and Maximum Likelihood¶
Frequentist definition of probability¶
In the frequentist view, the probability of an event is defined as its long-run relative frequency in infinitely many identical repetitions:
Probability statements are made about data (events in repeated experiments), not about parameters.
Maximum Likelihood for the magic coin¶
For the coin tossed times with observed heads, the likelihood of a candidate parameter is
We can find the maximum likelihood estimator (MLE) by maximizing (or ) with respect to :
Taking logs and differentiating,
Setting the derivative to zero and solving gives the familiar estimator
This is the standard frequentist answer to “what is the most likely that produced the data?”
Bayesian view: probability as degree of belief¶
In the Bayesian view, probability measures a degree of belief (plausibility) held by an observer.
Probability statements can be made not only about data but also about parameters.
Different observers can have different probabilities for the same event, because they may have different prior knowledge.
For an event and observed data , Bayes’ theorem updates our belief:
In more general (parameter) notation, with parameter and data , the posterior is
Here
is the prior (belief before seeing the data),
is the likelihood,
is the posterior (updated belief),
is the evidence (normalizing constant).
In proportional form (often used in practice):
Conditional probability¶
Given two events and with , the conditional probability of given is
This expresses “how likely is, knowing that has occurred”.
Similarly,
Rearranging these definitions, we can relate joint and conditional probabilities:
Bayes’ theorem — conditional probability inverter¶
Starting from the definition of conditional probability,
we equate the joint probability from both expressions:
Solving for gives Bayes’ theorem:
Bayes’ theorem inverts conditional probabilities: it allows us to compute from , provided we know the relevant priors and marginals.
Law of total probability (marginalisation)¶
Sometimes is not known directly. Suppose
the sample space is partitioned into disjoint events ,
so that for and .
Then the law of total probability states that
Special case with and its complement :
This is a key building block for computing the denominator in Bayes’ theorem:
Intuition examples: smokers and car belts¶
Smoker example¶
Let
: “person is female”
: “person is a smoker”
We may want to compare
: probability a smoker is female,
: probability a female is a smoker.
These are in general not equal, and confusing them is a common mistake.
Car belt example (base-rate intuition)¶
Let
: “person dies in a car accident”
: “person wears a seat belt”
It may be true that most people who died in car accidents were wearing a belt, so is large.
But this does not mean that wearing a belt is dangerous.
What really matters for your personal risk is vs , and because is extremely small, seat belts can dramatically reduce risk even if is high.
Medical tests: sensitivity, specificity, and the base-rate fallacy¶
Consider a diagnostic test, e.g. mammography for breast cancer.
Let
: “person has the disease (breast cancer)”,
: “person does not have the disease”,
: “test result is positive”,
: “test result is negative”.
The test is characterized by:
Sensitivity (true positive rate, TPR)
Specificity (true negative rate, TNR)
False positive rate (FPR)
Prevalence (base rate) of the disease in the population
Posterior probability of disease given a positive test¶
What we really want to know is the posterior
Using the law of total probability for ,
we obtain the key formula
Similarly, the probability of having the disease despite a negative test is
The base-rate fallacy occurs when we ignore and over-interpret or alone.
Bayes’ theorem as a belief updater¶
Bayesian statistics is all about updating beliefs in light of new data.
Given a hypothesis/event and data :
Prior: — belief in before seeing data.
Likelihood: — probability of seeing data if is true.
Evidence: — overall probability of the data, averaging over all possibilities.
Posterior: — belief in after seeing data.
Bayes’ theorem:
In continuous form with densities,
A famous slogan is
“Today’s posterior is tomorrow’s prior.”
As new data arrive, we repeatedly apply Bayes’ theorem to update our beliefs.
DNA tests in court: prosecutor’s fallacy¶
Let
: “suspect is guilty”,
: “suspect is not guilty”,
: “DNA test is a positive match”.
Often, expert witnesses report a very small probability of a false match:
The prosecutor’s fallacy is to confuse this with or , saying things like “the probability that the suspect is innocent is 1 in 2 million”.
The correct quantity for the court is posterior guilt:
Using the law of total probability,
so
Because the base rate (how many people in the population could be the culprit) is usually extremely small, even a very low does not automatically imply that is close to 1.
This is conceptually similar to the medical-test/base-rate example.
When do Bayesian and frequentist answers differ?¶
The two approaches can give similar or very different answers, depending on the situation.
They tend to differ when:
Strong prior information is available (Bayesian) and the data are weak or noisy.
There is a small sample size (few observations).
Base rates matter (rare events, like certain diseases or crimes).
In Bayesian inference we explicitly combine
while in frequentist maximum likelihood we use only the likelihood.
Bayesian methods make it possible to do real inference about parameters (e.g. “how plausible is each ?”), whereas frequentist tools focus on sampling properties and often provide only point estimates and hypothesis tests.