What is Bayesian Inference?
In science, experiment is the ultimate criterion for determining the validity of a hypothesis. Rarely, however, can an experimental result be obtained that validates one hypothesis and rules out all others. More commonly, the experimental result strengthens the support for some hypotheses and weakens the support for others. As more data are collected, it becomes progressively easier to identify the most likely hypothesis. At some point, the evidence is considered strong enough to accept or reject a given hypothesis, even if certainty is not achieved.
How much evidence is enough?
Consider the following scenario. I prepare card decks in two different ways. Some decks, called "low decks," consist of ten aces, nine 2s, eight 3s, and so on, down to two 9s and one 10. Others, called "high decks," are prepared in the reverse fashion: one ace, two 2s, three 3s, and so on, up to nine 9s and ten 10s. Both decks thus have 55 cards. If you draw one card at random, can you tell whether it came from a high deck or a low deck?
Of course you cannot. Any card could come from either deck. But a 10 is much more likely (in fact ten times more likely) to come from a high deck than a low deck. So if you drew a 10, you would have evidence to support the high-deck hypothesis over the low-deck hypothesis. (see fig. 1)
Can we quantify the relative likelihood of the two hypotheses after drawing the card? The answer is yes, and the method for doing so is known as Bayesian inference.
If there is one deck of each kind, and the dealer chooses a deck at random, then one can obviously conclude that drawing a 10 makes the high-deck hypothesis 10 times more likely than the low-deck hypothesis. There are eleven 10s altogether between the two decks, and only one of them comes from the low deck. We are just taking the probability that the high deck would produce the observed datum and comparing it with the probability that the low deck would produce the same datum.
But what if the dealer had three decks to choose from, two low and one high? Now there are two chances to draw a 10 from a low deck, not just one, so the datum does not so strongly support the high-deck hypothesis. The high-deck hypothesis is still the more likely of the two, but now by a factor of only 5, not 10.
If we keep increasing the number of low decks given to the dealer, we eventually reach a point (10 low decks, 1 high deck) where the hypotheses are equally likely, even when we draw a 10. If we give the dealer 100 low decks and only 1 high deck, we have the somewhat bemusing result that drawing a 10 actually implies that the deck used was a low deck! (see fig. 2)
This idea is expressed mathematically by Bayes's Theorem, P(H|D) ~ P(D|H)P(H): the probability of hypothesis H given data D (which is called the posterior probability) is proportional to the product of two factors. The first, P(D|H), is the probability of observing data D if hypothesis H is correct. P(H) is the a priori (or prior) probability that hypothesis is correct, before we collect the data. In our example, if H is the hypothesis that the dealer is holding a high deck, and the data is a 10 being drawn, P(D|H) is the probability of drawing a 10 from a high deck, 10/55. P(H) is the prior probability of a high deck being selected by the dealer, equal to the number of high decks divided by the total number of decks prepared. If we say there is only one high deck out of 101 total, P(H) is 1/101. We can now make a formal calculation of the relative strength of the two hypotheses under these conditions. Hh will now stand for the hypothesis that the dealer is using a high deck, and Hl will stand for the hypothesis that she is using a low deck.
P(Hh|D) P(D| Hh)P(Hh) (10/55)(1/101) 10
---------- = ------------------ = -------------------- = ------ = 0.1
P(Hl|D) P(D| Hl)P(Hl) (1/55)(100/101) 100
Although the datum in this case does not lead us to favor the high deck hypothesis over the low deck hypothesis, it does have the effect of reducing our confidence in the low deck hypothesis. Before drawing the card, we would have guessed that the dealer is 100 times more likely to be using a low deck; now we think she is only 10 times more likely to be using a low deck. If we return the card to the deck, shuffle it, and draw a 10 the second time also, the high-deck hypothesis will again be strengthened by a factor of 10, and the strength of the two hypotheses will be equal.
The Presumptuous Prior
The example above is not terribly exciting; it is an obvious application of elementary probability theory. What has made Bayesian inference so interesting to scientists in recent decades is that it makes explicit something often neglected when experiments are used to test a hypothesis: the prior probability, P(H). Bayes's Theorem says that you cannot properly say whether a given hypothesis is to be accepted, based on a particular experimental result, unless you also consider the prior probability that the hypothesis is correct.
Depending on one's training, this can statement can sound very strange. Isn't it unscientific to assume whether the hypothesis you are testing is likely or not? Shouldn't one be "objective" and consider only the data, not one's preconceptions about the relative likelihood of the competing hypotheses? After all, in scientific work we are rarely in the position to "count the number of decks" of each type before drawing a card . . . or a conclusion.
In fact, the traditional approach in many disciplines is not to use an explicit prior. Instead, one formulates a "null hypothesis," which is taken as a sort of default, and entertains alternative hypotheses only when the null hypothesis appears to be having trouble accounting for the data. The usual method is to say that a datum is "significant" if it would be produced by the null hypothesis in only a small fraction of cases (typically chosen arbitrarily at some level such as p = 0.05 or p = 0.01).
Such decision levels are certainly convenient, but they have no defensible basis unless the prior probabilities of the competing hypotheses can be estimated and are used to set the decision level. As we saw in the playing-card example, different prior probabilities lead directly to different posterior probabilities, and hence to different decision levels. Asserting that a particular hypothesis is favored if a datum departs from the null hypothesis at a given level of significance p is simply to assume that P(D|H)P(H)/P(H0) is equal to p, in effect making an arbitrary assumption about the ratio of priors, P(H)/P(H0). There is no reason to expect that p = 0.05, p = 0.01, or any such conventional decision level, will correspond to an accurate estimate of the prior probability of a hypothesis.
Bayesian inference, by making the estimation of the prior explicit, makes it possible to set the decision level for choosing between competing hypothesis at the optimum value. Consider again the playing-card example, using Hl as the null hypothesis. The probability of drawing a 10 if the null hypothesis is true is 1/55 = 0.018, a little less than 2%. If we choose a p = 0.05 significance level, drawing a 10 is always a significant result, and would always cause us to choose Hh over Hl. We only have better than even odds of making the correct decision if there happen to be less than 10 low decks for every high deck. If we choose p = 0.01, then drawing a 10 is never significant, and we will always stay with the null hypothesis, Hl. In that case, we are likely be wrong whenever there were fewer than 10 low decks for every high deck. Either way, we completely ignore the important information about how many decks of each kind are present. We also ignore P(D|Hh), the probability of drawing a 10 from a high deck. It should be clear that using an arbitrary significance level is a hit-or-miss affair unless the prior probabilities happen to fall in the right range.
In realistic situations, of course, estimating a prior probability may involve subjective judgement. Ideally, the data from many previous experiments can be used to estimate the prior. When previous data are sparse or inapplicable, assumptions must be made. Making such assumptions is equivalent to selecting an arbitrary decision level. The advantage of the Bayesian formulation is that it makes these assumptions explicit and easier to re-examine in the light of new information.
Bayesian Inference in Radiological Dose Assessment
At Los Alamos, Bayesian inference is being used to help determine whether individuals have experienced an intake of radioactive material. Urine is analyzed by thermal ionization mass spectroscopy in order to measure the quantity of plutonium-239 and other species present in the body. Intakes are very rare, so that even a rather large measurement result is more likely to be due to experimental uncertainty than to an actual intake. The situation is analogous to the case where there are many low decks and very few high decks, so that even drawing a 9 or 10 does not necessarily lead us to accept the hypothesis that the dealer is using a high deck. Using conventional decision levels, such as p = 0.05, leads to a high number of "false positives." Bayesian methods address this problem by optimizing the decision level to minimize the total number of incorrect inferences.
The Bayesian approach is not only more accurate, it is more informative. Because we can explicitly calculate the ratio of the likelihood of the competing hypotheses, it is possible to make statements such as, "The chance that an intake occurred is 37%". This is much more useful than simply reporting that a measurement result was above or below an arbitrary decision level.
The superiority of Bayesian inference over the use of arbitrary significance levels has been appreciated for about two centuries. In realistic problems, however, where there are many potential hypotheses to consider, Bayesian analysis becomes computationally intensive. Recent advances in computer technology have sparked a renewed use of Bayesian methods, so that they have now become widespread in areas such as medical diagnosis and astronomical imaging. Los Alamos National Laboratory is pioneering the application of this important method for radiological dose assessment.
The low deck has one 10; the high deck has ten 10s. If a 10 is drawn from one of these two decks, selected at random, the evidence strongly implies that the card came from the high deck.
In this figure, there are 20 decks present, and only one of them is a high deck. Consequently, most 10s are now from low decks, rather than from the high deck. With many low decks present, drawing a 10 will no longer cause us to accept the hypothesis that the card came from the high deck, although it will make us less confident that it came from a low deck.