Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Probability Theory and Bayes’ Theorem

Random experiments, events, and random variables

A random experiment produces outcomes from a sample space Ω\Omega of elementary events.

  • Example: rolling a fair die.
    Ω={1,2,3,4,5,6}\Omega = \{1,2,3,4,5,6\}

An event EE is a subset of Ω\Omega.

  • Example: “number is larger than 3”
    E={4,5,6}ΩE = \{4,5,6\} \subset \Omega

A random variable XX is a numerical quantity whose value depends on the outcome of the random experiment.

Examples:

  • XX: number of eyes shown after a dice roll (values in {1,2,3,4,5,6}\{1,2,3,4,5,6\}).

  • YY: outcome of a coin toss, represented as Y{0,1}Y \in \{0,1\} (e.g. 0=tails0=\text{tails}, 1=heads1=\text{heads}).

Probability and the Laplace model

The probability P(E)P(E) of an event EE satisfies

0P(E)10 \le P(E) \le 1

and quantifies how likely it is that EE will occur in the future. A very simple model is the Laplace model (principle of indifference):

The Laplace model is simple but limited: many real problems have outcomes with different probabilities, so we need richer probability models.

Probability mass function (PMF)

For a discrete random variable XX, the probability mass function (PMF) pXp_X assigns a probability to each possible value xx:

pX(x)=P(X=x).p_X(x) = P(X = x).

The PMF must satisfy:

  1. Non-negativity:

    pX(x)0for all xp_X(x) \ge 0 \quad \text{for all } x
  2. Normalization:

    xpX(x)=1\sum_{x} p_X(x) = 1

The expectation (mean) of XX is

E[X]=xxpX(x),\mathbb{E}[X] = \sum_{x} x \, p_X(x),

and the variance is

Var(X)=E[(XE[X])2]=x(xμ)2pX(x),where μ=E[X].\operatorname{Var}(X) = \mathbb{E}\big[(X - \mathbb{E}[X])^2\big] = \sum_{x} (x - \mu)^2 \, p_X(x), \quad \text{where } \mu = \mathbb{E}[X].

For continuous random variables, the PMF is replaced by a probability density function (PDF), and sums become integrals.

Sampling vs. inference — the magic coin and the binomial distribution

Consider a magic coin with probability pp of showing heads on any toss.

Sampling problem

If we know pp, what is the probability that we observe exactly kk heads in nn tosses?

The answer is given by the binomial distribution:

P(K=kp)=(nk)pk(1p)nk,k=0,1,,n.P(K = k \mid p) = \binom{n}{k} \, p^{k} (1-p)^{n-k}, \quad k = 0,1,\dots,n.

This is a sampling question: given pp, what does the data look like?

Inference problem

We toss the coin n=10n=10 times and observe k=8k=8 heads.
What can we say about the unknown probability pp?

This is an inference question: from observed data we want to learn about an unknown parameter pp.

  • Frequentist approach: estimate pp with a point estimate such as the maximum likelihood estimator (MLE).

  • Bayesian approach: treat pp itself as random, assign a prior belief, and update it using Bayes’ theorem to get a posterior.

Frequentist view and Maximum Likelihood

Frequentist definition of probability

In the frequentist view, the probability of an event EE is defined as its long-run relative frequency in infinitely many identical repetitions:

P(E)  =  limnnumber of times E occurs in n trialsn.P(E) \;=\; \lim_{n \to \infty} \frac{\text{number of times } E \text{ occurs in } n \text{ trials}}{n}.

Probability statements are made about data (events in repeated experiments), not about parameters.

Maximum Likelihood for the magic coin

For the coin tossed nn times with kk observed heads, the likelihood of a candidate parameter pp is

L(pk,n)  =  P(K=kp)  =  (nk)pk(1p)nk.L(p \mid k, n) \;=\; P(K = k \mid p) \;=\; \binom{n}{k} p^k (1-p)^{n-k}.

We can find the maximum likelihood estimator (MLE) p^\hat p by maximizing L(p)L(p) (or logL(p)\log L(p)) with respect to pp:

p^=argmaxp[0,1]  L(pk,n).\hat p = \underset{p \in [0,1]}{\arg\max}\; L(p \mid k, n).

Taking logs and differentiating,

(p)=logL(pk,n)=log(nk)+klogp+(nk)log(1p),\ell(p) = \log L(p \mid k,n) = \log \binom{n}{k} + k \log p + (n-k)\log(1-p),
ddp=kpnk1p.\frac{d\ell}{dp} = \frac{k}{p} - \frac{n-k}{1-p}.

Setting the derivative to zero and solving gives the familiar estimator

p^=kn.\hat p = \frac{k}{n}.

This is the standard frequentist answer to “what is the most likely pp that produced the data?”

Bayesian view: probability as degree of belief

In the Bayesian view, probability measures a degree of belief (plausibility) held by an observer.

  • Probability statements can be made not only about data but also about parameters.

  • Different observers can have different probabilities for the same event, because they may have different prior knowledge.

For an event AA and observed data dd, Bayes’ theorem updates our belief:

P(Ad)=P(dA)P(A)P(d).P(A \mid d) = \frac{P(d \mid A)\, P(A)}{P(d)}.

In more general (parameter) notation, with parameter θ\theta and data dd, the posterior is

p(θd)=p(dθ)p(θ)p(d).p(\theta \mid d) = \frac{p(d \mid \theta)\, p(\theta)}{p(d)}.

Here

  • p(θ)p(\theta) is the prior (belief before seeing the data),

  • p(dθ)p(d \mid \theta) is the likelihood,

  • p(θd)p(\theta \mid d) is the posterior (updated belief),

  • p(d)p(d) is the evidence (normalizing constant).

In proportional form (often used in practice):

p(θd)    p(dθ)p(θ).p(\theta \mid d) \;\propto\; p(d \mid \theta)\, p(\theta).

Conditional probability

Given two events AA and BB with P(B)>0P(B) > 0, the conditional probability of AA given BB is

P(AB)=P(AB)P(B).P(A \mid B) = \frac{P(A \cap B)}{P(B)}.

This expresses “how likely AA is, knowing that BB has occurred”.

Similarly,

P(BA)=P(AB)P(A).P(B \mid A) = \frac{P(A \cap B)}{P(A)}.

Rearranging these definitions, we can relate joint and conditional probabilities:

P(AB)=P(AB)P(B)=P(BA)P(A).P(A \cap B) = P(A \mid B)\, P(B) = P(B \mid A)\, P(A).

Bayes’ theorem — conditional probability inverter

Starting from the definition of conditional probability,

P(AB)=P(AB)P(B),P(BA)=P(AB)P(A),P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \qquad P(B \mid A) = \frac{P(A \cap B)}{P(A)},

we equate the joint probability P(AB)P(A \cap B) from both expressions:

P(AB)P(B)=P(BA)P(A).P(A \mid B)\, P(B) = P(B \mid A)\, P(A).

Solving for P(AB)P(A \mid B) gives Bayes’ theorem:

Bayes’ theorem inverts conditional probabilities: it allows us to compute P(AB)P(A \mid B) from P(BA)P(B \mid A), provided we know the relevant priors and marginals.

Law of total probability (marginalisation)

Sometimes P(B)P(B) is not known directly. Suppose

  • the sample space is partitioned into disjoint events A1,,AnA_1, \dots, A_n,

  • so that AiAj=A_i \cap A_j = \varnothing for iji \ne j and i=1nAi=Ω\bigcup_{i=1}^n A_i = \Omega.

Then the law of total probability states that

P(B)=i=1nP(BAi)=i=1nP(BAi)P(Ai).P(B) = \sum_{i=1}^{n} P(B \cap A_i) = \sum_{i=1}^{n} P(B \mid A_i)\, P(A_i).

Special case with AA and its complement Aˉ\bar A:

P(B)=P(BA)P(A)+P(BAˉ)P(Aˉ).P(B) = P(B \mid A)\, P(A) + P\big(B \mid \bar A\big)\, P\big(\bar A\big).

This is a key building block for computing the denominator in Bayes’ theorem:

P(AB)=P(BA)P(A)iP(BAi)P(Ai).P(A \mid B) = \frac{P(B \mid A)\, P(A)}{\displaystyle \sum_{i} P(B \mid A_i)\, P(A_i)}.

Intuition examples: smokers and car belts

Smoker example

Let

  • FF: “person is female

  • SS: “person is a smoker

We may want to compare

  • P(FS)P(F \mid S): probability a smoker is female,

  • P(SF)P(S \mid F): probability a female is a smoker.

These are in general not equal, and confusing them is a common mistake.

Car belt example (base-rate intuition)

Let

  • AA: “person dies in a car accident

  • BB: “person wears a seat belt

It may be true that most people who died in car accidents were wearing a belt, so P(BA)P(B \mid A) is large.
But this does not mean that wearing a belt is dangerous.

What really matters for your personal risk is P(AB)P(A \mid B) vs P(ABˉ)P(A \mid \bar B), and because P(A)P(A) is extremely small, seat belts can dramatically reduce risk even if P(BA)P(B \mid A) is high.

Medical tests: sensitivity, specificity, and the base-rate fallacy

Consider a diagnostic test, e.g. mammography for breast cancer.

Let

  • BB: “person has the disease (breast cancer)”,

  • Bˉ\bar B: “person does not have the disease”,

  • TT: “test result is positive”,

  • Tˉ\bar T: “test result is negative”.

The test is characterized by:

  • Sensitivity (true positive rate, TPR)

    sensitivity=P(TB)\text{sensitivity} = P(T \mid B)
  • Specificity (true negative rate, TNR)

    specificity=P(TˉBˉ)\text{specificity} = P(\bar T \mid \bar B)
  • False positive rate (FPR)

    FPR=P(TBˉ)=1specificity.\text{FPR} = P(T \mid \bar B) = 1 - \text{specificity}.
  • Prevalence (base rate) of the disease in the population

    prevalence=P(B).\text{prevalence} = P(B).

Posterior probability of disease given a positive test

What we really want to know is the posterior

P(BT)=P(TB)P(B)P(T).P(B \mid T) = \frac{P(T \mid B)\, P(B)}{P(T)}.

Using the law of total probability for P(T)P(T),

P(T)=P(TB)P(B)+P(TBˉ)P(Bˉ),P(T) = P(T \mid B)\, P(B) + P(T \mid \bar B)\, P(\bar B),

we obtain the key formula

P(BT)=P(TB)P(B)P(TB)P(B)+P(TBˉ)P(Bˉ).P(B \mid T) = \frac{P(T \mid B)\, P(B)} {P(T \mid B)\, P(B) + P(T \mid \bar B)\, P(\bar B)}.

Similarly, the probability of having the disease despite a negative test is

P(BTˉ)=P(TˉB)P(B)P(TˉB)P(B)+P(TˉBˉ)P(Bˉ).P(B \mid \bar T) = \frac{P(\bar T \mid B)\, P(B)} {P(\bar T \mid B)\, P(B) + P(\bar T \mid \bar B)\, P(\bar B)}.

The base-rate fallacy occurs when we ignore P(B)P(B) and over-interpret P(TB)P(T \mid B) or P(TBˉ)P(T \mid \bar B) alone.

Bayes’ theorem as a belief updater

Bayesian statistics is all about updating beliefs in light of new data.

Given a hypothesis/event AA and data dd:

  • Prior: P(A)P(A) — belief in AA before seeing data.

  • Likelihood: P(dA)P(d \mid A) — probability of seeing data dd if AA is true.

  • Evidence: P(d)P(d) — overall probability of the data, averaging over all possibilities.

  • Posterior: P(Ad)P(A \mid d) — belief in AA after seeing data.

Bayes’ theorem:

P(Ad)=P(dA)P(A)P(d).P(A \mid d) = \frac{P(d \mid A)\, P(A)}{P(d)}.

In continuous form with densities,

p(θd)=p(dθ)p(θ)p(d),p(d)=p(dθ)p(θ)dθ.p(\theta \mid d) = \frac{p(d \mid \theta)\, p(\theta)}{p(d)}, \qquad p(d) = \int p(d \mid \theta)\, p(\theta)\, d\theta.

A famous slogan is

“Today’s posterior is tomorrow’s prior.”

As new data arrive, we repeatedly apply Bayes’ theorem to update our beliefs.

DNA tests in court: prosecutor’s fallacy

Let

  • GG: “suspect is guilty”,

  • Gˉ\bar G: “suspect is not guilty”,

  • TT: “DNA test is a positive match”.

Often, expert witnesses report a very small probability of a false match:

P(TGˉ)12000000(for example).P(T \mid \bar G) \approx \frac{1}{2\,000\,000} \quad \text{(for example)}.

The prosecutor’s fallacy is to confuse this with P(GˉT)P(\bar G \mid T) or P(GT)P(G \mid T), saying things like “the probability that the suspect is innocent is 1 in 2 million”.

The correct quantity for the court is posterior guilt:

P(GT)=P(TG)P(G)P(T).P(G \mid T) = \frac{P(T \mid G)\, P(G)}{P(T)}.

Using the law of total probability,

P(T)=P(TG)P(G)+P(TGˉ)P(Gˉ),P(T) = P(T \mid G)\, P(G) + P(T \mid \bar G)\, P(\bar G),

so

P(GT)=P(TG)P(G)P(TG)P(G)+P(TGˉ)P(Gˉ).P(G \mid T) = \frac{P(T \mid G)\, P(G)} {P(T \mid G)\, P(G) + P(T \mid \bar G)\, P(\bar G)}.

Because the base rate P(G)P(G) (how many people in the population could be the culprit) is usually extremely small, even a very low P(TGˉ)P(T \mid \bar G) does not automatically imply that P(GT)P(G \mid T) is close to 1.

This is conceptually similar to the medical-test/base-rate example.

When do Bayesian and frequentist answers differ?

The two approaches can give similar or very different answers, depending on the situation.

They tend to differ when:

  1. Strong prior information is available (Bayesian) and the data are weak or noisy.

  2. There is a small sample size (few observations).

  3. Base rates matter (rare events, like certain diseases or crimes).

In Bayesian inference we explicitly combine

posteriorlikelihood×prior,\text{posterior} \propto \text{likelihood} \times \text{prior},

while in frequentist maximum likelihood we use only the likelihood.

Bayesian methods make it possible to do real inference about parameters (e.g. “how plausible is each pp?”), whereas frequentist tools focus on sampling properties and often provide only point estimates and hypothesis tests.