Bayes' rule
Bayes' theorem is a result in probability theory,
which states the conditional probability of a variable A given B in terms of the conditional probability of variable B given A and the marginal probability of A alone.
As a mathematical theorem,
Bayes' theorem is valid in all interpretations of probability.
However, there is disagreement as to what kinds of variables can be substituted for A and B in the theorem;
this topic is treated at greater length in the articles on Bayesian probability and frequentist probability.
Historical remarks
Bayes' theorem is named after the Reverend Thomas Bayes (1702--61). Bayes worked on the problem of computing a distribution for the parameter of a binomial distribution (to use modern terminology); his work was edited and presented posthumously (1763) by his friend Richard Price, in An Essay towards solving a Problem in the Doctrine of Chances. Bayes' results were replicated and extended by Laplace in an essay of 1774, who apparently was not aware of Bayes' work.
The main result (Proposition 9 in the essay) derived by Bayes is the following: assuming a uniform distribution for the prior distribution of the binomial parameter p, the probability that p is between two values a and b is
where m is the number of observed successes and n the number of observed failures. His preliminary results, in particular Propositions 3, 4, and 5, imply the result now called Bayes' Theorem (as described below), but it does not appear that Bayes himself emphasized or focused on that result.
What is "Bayesian" about Proposition 9 is that Bayes presented it as a probability for the parameter p. That is, not only can one compute probabilities for experimental outcomes, but also for the parameter which governs them, and the same algebra is used to make inferences of either kind. Interestingly, Bayes actually states his question in a way that might make the idea of assigning a probability distribution to a parameter palatable to a frequentist. He supposes that a billiard ball is thrown at random onto a billiard table, and that the probabilities p and q are the probabilities that subsequent billiard balls will fall above or below the first ball. By making the binomial parameter p depend on a random event, he cleverly escapes a philosophical quagmire that he most likely was not even aware was an issue.
Statement of Bayes' theorem
Bayes' theorem is a relation among conditional and marginal probabilities.
It can be viewed as a means of incorporating information, from an observation, for example, to produce a modified or updated probability distribution.
Bayes' theorem is considered valid in all interpretations of probability,
but its applicability has greater or lesser scope.
Bayesian probability, as a school of thought, is based on the notion that Bayes' theorem can be applied to any propositions, whether they are statements about experimental variables, parameters, hypotheses, latent (unobserved) variables, or any other kind of statements. In contrast, according to the frequentist school, Bayes' theorem can only be applied to problems in which all statements are about experimental variables alone,
and therefore some non-probabilistic inference mechanism must be used to reason about any other kind of statement.
To derive Bayes' theorem, note first from the definition of conditional probability that
denoting by P(A,B) the joint probability of A and B.
Dividing the left- and right-hand sides by P(B), we obtain
which is Bayes' theorem.
Each term in Bayes' theorem has a conventional name.
The term P(A) is called the prior probability of A.
It is "prior" in the sense that it precedes any information about B.
Equivalently, P(A) is also called the marginal probability of A.
The term P(A|B) is called the posterior probability of A, given B. It is "posterior" in the sense that it is derived from or entailed by the specified value of B.
The term P(B|A), for a specific value of B, is called the likelihood function for A.
The term P(B) is called the prior or marginal probability of B.
Alternative forms of Bayes' theorem
Bayes' theorem is often embellished by noting that
so the theorem can be restated as
where AC is the complementary event of A. More generally, where {Ai} forms a partition of the event space,
for any Ai in the partition.
See also the law of total probability.
Bayes' theorem for probability densities
There is also a version of Bayes' theorem for continuous distributions.
It is somewhat harder to derive, since probability densities,
strictly speaking, are not probabilities,
so Bayes' theorem has to be established by a limit process;
see Papoulis (citation below), Section 7.3 for an elementary derivation.
Bayes' theorem for probability densities is formally similar to the theorem for probabilities:
-
and there is an analogous statement of the law of total probability:
-
As in the discrete case,
the terms have standard names.
f(x, y) is the joint distribution of x and y,
f(y|x) is the posterior distribution,
f(x|y) is the likelihood function,
and f(x) and f(y) are marginal distributions.
Here we have indulged in a conventional abuse of notation,
using f for each one of these terms,
although each one is really a different function;
the functions are distinguished by the names of their arguments.
Extensions of Bayes' theorem
Theorems analogous to Bayes' theorem hold in problems with more than two variables.
These theorems are not given distinct names,
as they may be mass-produced by applying the laws of probability.
The general strategy is to work with a decomposition of the joint probability, and to marginalize (integrate) over the variables that are not of interest.
Depending on the form of the decomposition,
it may be possible to prove that some integrals must be 1,
and thus they fall out of the decomposition;
exploiting this property can reduce the computations very substantially.
A Bayesian network is essentially a mechanism for automatically generating the extensions of Bayes' theorem that are appropriate for a given decomposition of the joint probability.
Examples
From which bowl is the cookie?
To illustrate, suppose there are two bowls full of cookies. Bowl #1 has 10 chocolate chip and 30 plain cookies, while bowl #2 has 20 of each. Our friend Fred picks a bowl at random, and then picks a cookie at random. We may assume there is no reason to believe Fred treats one bowl differently from another, likewise for the cookies. The cookie turns out to be a plain one. How probable is it that Fred picked it out of bowl #1?
Intuitively, it seems clear that the answer should be more than 50%, since there are more plain cookies in bowl #1. The precise answer is given by Bayes' theorem. Let H1 corresponds to bowl #1, and H2 to bowl #2.
It is given that the bowls are identical from Fred's point of view, thus P(H1) = P(H2), and the two must add up to 1, so both are equal to 50%.
The "data" D consists in the observation of a plain cookie. From the contents of the bowls, we know that P(D | H1) = 30/40 = 75% and P(D | H2) = 20/40 = 50%. Bayes' formula then yields
Before observing the cookie, the probability that Fred chose bowl #1 is the prior probability, P(H1), which is 50%.
After observing the cookie, we revise the probability to P(H1|D), which is 60%.
False positives in a medical test
False positives are a problem in any kind of test: no test is perfect, and sometimes the test will incorrectly report a positive result. For example, if a test for a particular disease is performed on a patient, then there is a chance (usually small) that the test will return a postive result even if the patient does not have the disease. The problem lies, however, not just in the chance of a false positive prior to testing, but determining the chance that a positive result is in fact a false positive. As we will demonstrate, using Bayes' theorem, if a condition is rare, then the majority of positive results may be false positives, even if the test for that condition is (otherwise) reasonably accurate.
Suppose that a test for a particular disease has a very high success rate:
- if a tested patient has the disease, the test accurately reports this, a 'positive', 99% of the time (or, with probability 0.99), and
- if a tested patient does not have the disease, the test accurately reports that, a 'negative', 95% of the time (i.e. with probability 0.95).
Suppose also, however, that only 0.1% of the population have that disease (i.e. with probability 0.001). We now have all the information required to use Bayes' theorem to calculate the probability that, given the test was positive, that it is a false positive.
Let A be the event that the patient has the disease, and B be the event that the test returns a positive result. Then, using the second form of Bayes' theorem (above), the probability of a true positive is
and hence the probability of a false positive is about (1 − 0.019) = 0.981.
Despite the apparent high accuracy of the test, the incidence of the disease is so low (one in a thousand) that the vast majority of patients who test positive (98 in a hundred) do not have the disease. (Nonetheless, this is 20 times the proportion before we knew the outcome of the test! The test is not useless, and re-testing may improve the reliability of the result.) In this case, Bayes' theorem helps show that the accuracy of tests for rare conditions must be very high in order to produce reliable results from a single test, due to the posibility of false positives.
Posterior distribution of the binomial parameter
In this example we consider the computation of the posterior distribution for the binomial parameter.
This is the same problem considered by Bayes in his essay.
We are given m observed successes and n observed failures in a binomial experiment.
The experiment may be tossing a coin, drawing a ball from an urn, or asking someone their opinion, among many other possibilities.
What we know about the parameter (let's call it a) is stated as the prior distribution, p(a).
For a given value of a,
the probability of m successes in m+n trials is
-
Since m and n are fixed, and a is unknown,
this is a likelihood function for a.
From the continuous form of the law of total probability we have
-
For some special choices of the prior distribution p(a),
the integral can be solved and the posterior takes a convenient form.
In particular,
if p(a) is a beta distribution with parameters m0 and n0,
then the posterior is also a beta distribution with parameters m+m0 and n+n0.
A conjugate prior is a prior distribution, such as the beta distribution in the above example, which has the property that the posterior is the same type of distribution.
References
Versions of the essay
- T. Bayes (1763), "An Essay towards solving a Problem in the Doctrine of Chances", Philosophical Transactions of the Royal Society of London, 53.
- T. Bayes (1763/1958) "Studies in the History of Probability and Statistics: IX. Thomas Bayes's Essay Towards Solving a Problem in the Doctrine of Chances", Biometrika 45:296-315 (Bayes's essay in modernized notation)
- T. Bayes "An essay towards solving a Problem in the Doctrine of Chances" (Bayes's essay in the original notation)
Commentaries
- G.A. Barnard. (1958) "Studies in the History of Probability and Statistics: IX. Thomas Bayes's Essay Towards Solving a Problem in the Doctrine of Chances", Biometrika 45:293-295 (biographical remarks)
- D. Covarrubias "An Essay Towards Solving a Problem in the Doctrine of Chances" (an outline and exposition of Bayes's essay)
- S.M. Stigler (1982) "Thomas Bayes' Bayesian Inference," Journal of the Royal Statistical Society, Series A, 145:250-258 (Stigler argues for a revised interpretation of the essay -- recommended)
- I. Todhunter (1865) A History of the Mathematical Theory of Probability from the time of Pascal to that of Laplace, Macmillan. Reprinted 1949, 1956 by Chelsea and 2001 by Thoemmes.
Additional material
- P.S. Laplace (1774) "Mémoire sur la Probabilité des Causes par les Événements," Savants Étranges 6:621-656, also Oeuvres 8:27-65.
- P.S. Laplace (1774/1986), "Memoir on the Probability of the Causes of Events", Statistical Science, 1(3):364--378.
- S.M. Stigler (1986), "Laplace's 1774 memoir on inverse probability," Statistical Science, 1(3):359--378.
- Jeff Miller Earliest Known Uses of Some of the Words of Mathematics (B) (very informative -- recommended)
- A. Papoulis (1984), Probability, Random Variables, and Stochastic Processes, second edition. New York: McGraw-Hill.
See also
|