Probability Theory and Bayes’ Theorem

Random experiments, events, and random variables¶

A random experiment produces outcomes from a sample space $\Omega$ of elementary events.

Example: rolling a fair die.
$\Omega = \{1,2,3,4,5,6\}$

An event $E$ is a subset of $\Omega$ .

Example: “number is larger than 3”
$E = \{4,5,6\} \subset \Omega$

A random variable $X$ is a numerical quantity whose value depends on the outcome of the random experiment.

Examples:

$X$ : number of eyes shown after a dice roll (values in $\{1,2,3,4,5,6\}$ ).
$Y$ : outcome of a coin toss, represented as $Y \in \{0,1\}$ (e.g. $0=\text{tails}$ , $1=\text{heads}$ ).

Probability and the Laplace model¶

The probability $P(E)$ of an event $E$ satisfies

0 \le P(E) \le 1

(1)

and quantifies how likely it is that $E$ will occur in the future. A very simple model is the Laplace model (principle of indifference):

The Laplace model is simple but limited: many real problems have outcomes with different probabilities, so we need richer probability models.

Probability mass function (PMF)¶

For a discrete random variable $X$ , the probability mass function (PMF) $p_X$ assigns a probability to each possible value $x$ :

p_X(x) = P(X = x).

(3)

The PMF must satisfy:

Non-negativity:
$p_X(x) \ge 0 \quad \text{for all } x$
(4)
Normalization:
$\sum_{x} p_X(x) = 1$
(5)

The expectation (mean) of $X$ is

\mathbb{E}[X] = \sum_{x} x \, p_X(x),

(6)

and the variance is

\operatorname{Var}(X) = \mathbb{E}\big[(X - \mathbb{E}[X])^2\big] = \sum_{x} (x - \mu)^2 \, p_X(x), \quad \text{where } \mu = \mathbb{E}[X].

(7)

For continuous random variables, the PMF is replaced by a probability density function (PDF), and sums become integrals.

Sampling vs. inference — the magic coin and the binomial distribution¶

Consider a magic coin with probability $p$ of showing heads on any toss.

Sampling problem¶

If we know $p$ , what is the probability that we observe exactly $k$ heads in $n$ tosses?

The answer is given by the binomial distribution:

P(K = k \mid p) = \binom{n}{k} \, p^{k} (1-p)^{n-k}, \quad k = 0,1,\dots,n.

(8)

This is a sampling question: given $p$ , what does the data look like?

Inference problem¶

We toss the coin $n=10$ times and observe $k=8$ heads.
What can we say about the unknown probability $p$ ?

This is an inference question: from observed data we want to learn about an unknown parameter $p$ .

Frequentist approach: estimate $p$ with a point estimate such as the maximum likelihood estimator (MLE).
Bayesian approach: treat $p$ itself as random, assign a prior belief, and update it using Bayes’ theorem to get a posterior.

Frequentist view and Maximum Likelihood¶

Frequentist definition of probability¶

In the frequentist view, the probability of an event $E$ is defined as its long-run relative frequency in infinitely many identical repetitions:

P(E) \;=\; \lim_{n \to \infty} \frac{\text{number of times } E \text{ occurs in } n \text{ trials}}{n}.

(9)

Probability statements are made about data (events in repeated experiments), not about parameters.

Maximum Likelihood for the magic coin¶

For the coin tossed $n$ times with $k$ observed heads, the likelihood of a candidate parameter $p$ is

L(p \mid k, n) \;=\; P(K = k \mid p) \;=\; \binom{n}{k} p^k (1-p)^{n-k}.

(10)

We can find the maximum likelihood estimator (MLE) $\hat p$ by maximizing $L(p)$ (or $\log L(p)$ ) with respect to $p$ :

\hat p = \underset{p \in [0,1]}{\arg\max}\; L(p \mid k, n).

(11)

Taking logs and differentiating,

\ell(p) = \log L(p \mid k,n) = \log \binom{n}{k} + k \log p + (n-k)\log(1-p),

(12)

\frac{d\ell}{dp} = \frac{k}{p} - \frac{n-k}{1-p}.

(13)

Setting the derivative to zero and solving gives the familiar estimator

\hat p = \frac{k}{n}.

(14)

This is the standard frequentist answer to “what is the most likely $p$ that produced the data?”

Bayesian view: probability as degree of belief¶

In the Bayesian view, probability measures a degree of belief (plausibility) held by an observer.

Probability statements can be made not only about data but also about parameters.
Different observers can have different probabilities for the same event, because they may have different prior knowledge.

For an event $A$ and observed data $d$ , Bayes’ theorem updates our belief:

P(A \mid d) = \frac{P(d \mid A)\, P(A)}{P(d)}.

(15)

In more general (parameter) notation, with parameter $\theta$ and data $d$ , the posterior is

p(\theta \mid d) = \frac{p(d \mid \theta)\, p(\theta)}{p(d)}.

(16)

Here

$p(\theta)$ is the prior (belief before seeing the data),
$p(d \mid \theta)$ is the likelihood,
$p(\theta \mid d)$ is the posterior (updated belief),
$p(d)$ is the evidence (normalizing constant).

In proportional form (often used in practice):

p(\theta \mid d) \;\propto\; p(d \mid \theta)\, p(\theta).

(17)

Conditional probability¶

Given two events $A$ and $B$ with $P(B) > 0$ , the conditional probability of $A$ given $B$ is

P(A \mid B) = \frac{P(A \cap B)}{P(B)}.

(18)

This expresses “how likely $A$ is, knowing that $B$ has occurred”.

Similarly,

P(B \mid A) = \frac{P(A \cap B)}{P(A)}.

(19)

Rearranging these definitions, we can relate joint and conditional probabilities:

P(A \cap B) = P(A \mid B)\, P(B) = P(B \mid A)\, P(A).

(20)

Bayes’ theorem — conditional probability inverter¶

Starting from the definition of conditional probability,

P(A \mid B) = \frac{P(A \cap B)}{P(B)}, \qquad P(B \mid A) = \frac{P(A \cap B)}{P(A)},

(21)

we equate the joint probability $P(A \cap B)$ from both expressions:

P(A \mid B)\, P(B) = P(B \mid A)\, P(A).

(22)

Solving for $P(A \mid B)$ gives Bayes’ theorem:

Bayes’ theorem inverts conditional probabilities: it allows us to compute $P(A \mid B)$ from $P(B \mid A)$ , provided we know the relevant priors and marginals.

Law of total probability (marginalisation)¶

Sometimes $P(B)$ is not known directly. Suppose

the sample space is partitioned into disjoint events $A_1, \dots, A_n$ ,
so that $A_i \cap A_j = \varnothing$ for $i \ne j$ and $\bigcup_{i=1}^n A_i = \Omega$ .

Then the law of total probability states that

P(B) = \sum_{i=1}^{n} P(B \cap A_i) = \sum_{i=1}^{n} P(B \mid A_i)\, P(A_i).

(24)

Special case with $A$ and its complement $\bar A$ :

P(B) = P(B \mid A)\, P(A) + P\big(B \mid \bar A\big)\, P\big(\bar A\big).

(25)

This is a key building block for computing the denominator in Bayes’ theorem:

P(A \mid B) = \frac{P(B \mid A)\, P(A)}{\displaystyle \sum_{i} P(B \mid A_i)\, P(A_i)}.

(26)

Intuition examples: smokers and car belts¶

Smoker example¶

Let

$F$ : “person is female”
$S$ : “person is a smoker”

We may want to compare

$P(F \mid S)$ : probability a smoker is female,
$P(S \mid F)$ : probability a female is a smoker.

These are in general not equal, and confusing them is a common mistake.

Car belt example (base-rate intuition)¶

Let

$A$ : “person dies in a car accident”
$B$ : “person wears a seat belt”

It may be true that most people who died in car accidents were wearing a belt, so $P(B \mid A)$ is large.
But this does not mean that wearing a belt is dangerous.

What really matters for your personal risk is $P(A \mid B)$ vs $P(A \mid \bar B)$ , and because $P(A)$ is extremely small, seat belts can dramatically reduce risk even if $P(B \mid A)$ is high.

Medical tests: sensitivity, specificity, and the base-rate fallacy¶

Consider a diagnostic test, e.g. mammography for breast cancer.

Let

$B$ : “person has the disease (breast cancer)”,
$\bar B$ : “person does not have the disease”,
$T$ : “test result is positive”,
$\bar T$ : “test result is negative”.

The test is characterized by:

Sensitivity (true positive rate, TPR)
$\text{sensitivity} = P(T \mid B)$
(27)
Specificity (true negative rate, TNR)
$\text{specificity} = P(\bar T \mid \bar B)$
(28)
False positive rate (FPR)
$\text{FPR} = P(T \mid \bar B) = 1 - \text{specificity}.$
(29)
Prevalence (base rate) of the disease in the population
$\text{prevalence} = P(B).$
(30)

Posterior probability of disease given a positive test¶

What we really want to know is the posterior

P(B \mid T) = \frac{P(T \mid B)\, P(B)}{P(T)}.

(31)

Using the law of total probability for $P(T)$ ,

P(T) = P(T \mid B)\, P(B) + P(T \mid \bar B)\, P(\bar B),

(32)

we obtain the key formula

P(B \mid T) = \frac{P(T \mid B)\, P(B)} {P(T \mid B)\, P(B) + P(T \mid \bar B)\, P(\bar B)}.

(33)

Similarly, the probability of having the disease despite a negative test is

P(B \mid \bar T) = \frac{P(\bar T \mid B)\, P(B)} {P(\bar T \mid B)\, P(B) + P(\bar T \mid \bar B)\, P(\bar B)}.

(34)

The base-rate fallacy occurs when we ignore $P(B)$ and over-interpret $P(T \mid B)$ or $P(T \mid \bar B)$ alone.

Bayes’ theorem as a belief updater¶

Bayesian statistics is all about updating beliefs in light of new data.

Given a hypothesis/event $A$ and data $d$ :

Prior: $P(A)$ — belief in $A$ before seeing data.
Likelihood: $P(d \mid A)$ — probability of seeing data $d$ if $A$ is true.
Evidence: $P(d)$ — overall probability of the data, averaging over all possibilities.
Posterior: $P(A \mid d)$ — belief in $A$ after seeing data.

Bayes’ theorem:

P(A \mid d) = \frac{P(d \mid A)\, P(A)}{P(d)}.

(35)

In continuous form with densities,

p(\theta \mid d) = \frac{p(d \mid \theta)\, p(\theta)}{p(d)}, \qquad p(d) = \int p(d \mid \theta)\, p(\theta)\, d\theta.

(36)

A famous slogan is

“Today’s posterior is tomorrow’s prior.”

As new data arrive, we repeatedly apply Bayes’ theorem to update our beliefs.

DNA tests in court: prosecutor’s fallacy¶

Let

$G$ : “suspect is guilty”,
$\bar G$ : “suspect is not guilty”,
$T$ : “DNA test is a positive match”.

Often, expert witnesses report a very small probability of a false match:

P(T \mid \bar G) \approx \frac{1}{2\,000\,000} \quad \text{(for example)}.

(37)

The prosecutor’s fallacy is to confuse this with $P(\bar G \mid T)$ or $P(G \mid T)$ , saying things like “the probability that the suspect is innocent is 1 in 2 million”.

The correct quantity for the court is posterior guilt:

P(G \mid T) = \frac{P(T \mid G)\, P(G)}{P(T)}.

(38)

Using the law of total probability,

P(T) = P(T \mid G)\, P(G) + P(T \mid \bar G)\, P(\bar G),

(39)

so

P(G \mid T) = \frac{P(T \mid G)\, P(G)} {P(T \mid G)\, P(G) + P(T \mid \bar G)\, P(\bar G)}.

(40)

Because the base rate $P(G)$ (how many people in the population could be the culprit) is usually extremely small, even a very low $P(T \mid \bar G)$ does not automatically imply that $P(G \mid T)$ is close to 1.

This is conceptually similar to the medical-test/base-rate example.

When do Bayesian and frequentist answers differ?¶

The two approaches can give similar or very different answers, depending on the situation.

They tend to differ when:

Strong prior information is available (Bayesian) and the data are weak or noisy.
There is a small sample size (few observations).
Base rates matter (rare events, like certain diseases or crimes).

In Bayesian inference we explicitly combine

\text{posterior} \propto \text{likelihood} \times \text{prior},

(41)

while in frequentist maximum likelihood we use only the likelihood.

Bayesian methods make it possible to do real inference about parameters (e.g. “how plausible is each $p$ ?”), whereas frequentist tools focus on sampling properties and often provide only point estimates and hypothesis tests.