The Binomial Distribution

Home About us Mathematical Epidemiology Rweb EPITools Statistics Notes Web Design Contact us Links

> Home > Statistics Notes > Probability > Binomial Distribution

The Binomial Distribution

Bernoulli Trials

Suppose I do an experiment that has only two outcomes. For instance, suppose I toss a coin and determine whether it is heads or tails. Or I may sample individuals and determine whether they have a risk factor or not. Or I may toss a die and determine whether it lands 1 or not (although the die has 6 faces that could show up, we can think of it in terms of either landing 1 up, or not landing 1 up - wtwo outcomes.)
We need to be able to distinguish these outcomes. It is traditional to call one of them "success" and the other "failure". "Success" in this context need have no connotation of actual success - it is just a convenient label. So we could think of "success" as rolling a 1, or of finding the risk factor.
Suppose the probability of success is p. Then the probability of failure is 1-p.
An experiment which has two outcomes is often called a Bernoulli trial. So tossing a coin is a Bernoulli trial with success probability 1/2. We can think of the die experiment above as a Bernoulli trial with success probability 1/6. We could even treat a needlestick injury with Hepatitis C contaminated blood as a Bernoulli trial - with "success" probability 0.03 or so (again, the word "success" in this context is simply a connotation-free label.)

The Binomial distribution

If one undertakes N independent Bernoulli trials, then the binomial distribution tells you the probability of getting r successes in these N trials. So for instance, if 51 people experience a needlestick injury with Hepatitis C contaminated blood, we may suppose that this could be treated as 51 independent Bernoulli trials, and use the Binomial distribution to calculate the probability that 0 individuals were infected, 1 individual was infected, and so on all the way up to 51 individuals infected. (In the real world, other models for the needlestick injury may be appropriate; perhaps the probability of infection depends on other factors, such as the volume of blood on the device or the depth of injury.) Another example of the Binomial distribution would be to determine the probability of seeing 3 heads out of 5 tosses of a coin, or seeing a 1 four times if we toss a die 24 times.
We may also be interested in other outcomes. For instance, we may wish to know the probability that at most 3 people were infected by Hepatitis C. The phrase at most signifies the largest number; so saying that "at most 3 are infected" indicates that either 0, 1, 2, or 3 could have been infected. "At most 3" means that 3 is the "most" number possible. A person who writes on a questionnaire that they exercize "at most once per week" may exercise once a week, or not at all. The binomial distribution can tell us directly the chance of getting 0 infections, and also the chance of getting 1 infection, 2 infections, and 3 infections. This gives us four separate numbers, one for each of these probabilities. If we want to know the probability that at most 3 people were infected, we observe that the event "at most three were infected" is the union of the events "no people were infected", "exactly one person was infected", "exactly two people were infected", and "exactly three people were infected". Moreover, if exactly two people were infected (say), then it is false that exactly three people were infected. You can see that each of these four events is disjoint from the others. So the probability of the union of these four events is the sum of the probabilities of each of them. This is why you can just add the four probabilities together, and get the probability that at most three individuals were infected.
Another example is the probability that "at least 1 person was infected". The phrase "at least 1" means that 1 is the least possible number; saying "at least one person was infected" means that 1 person was infected, or 2, or 3, or any other number up to 51 (in this experiment). We could in fact calculate all these probabilities and add them up; this is sometimes the easiest thing to do. In this case, we could observe that the probability that at least one person was infected is one minus the probability that it is false that at least one person was infected. The event "it is false that at least one person was infected" is the same thing as the event "no people were infected". So we can just calculate this and subtract it from one. For the needlestick problem, it is easier to calculate a single probability and do the subtraction than it is to calculate 50 probabilities and add them together.

The binomial formula

Next, we will learn how to actually compute these probabilities. Let us start with the needlestick example again. We continue to suppose that each needlestick event is independent of the others, and that they all lead to the same infection probabilities (so we continue to treat these as independent Bernoulli trials). Suppose the infection probability is 3%.
Now, what is the probability that no one was infected? The probability that the first person is not infected is 97%. The probability that the second person is not infected is also 97%. And so on; all the probabilities are 97%. Let's ask what the chance is that both the first and the second persons were not infected. Because the events are independent, the probabilities can be multiplied; the probability is then 0.97 times 0.97, which is 0.9409. What is the chance that the first three are not infected? This is 0.97 times 0.97 times 0.97, which is about 91%. What is the chance that all 51 people were not infected? It is what you get when you multiply 0.97 by itself 51 times, or 0.97 to the 51st power; this number is 15.6%.
     Let's take another example. What is the chance that if we toss a fair die 4 times, we get NO occurrences of a 1? We'll treat this as 4 independent Bernoulli trials. The chance that the first die does not show a one is 5/6, the chance that the second die does not show a one is 5/6, and so forth; because the rolls are independent, we can find the probability that all the rolls failed to show a one (which is the same as the probability that none of them showed a one, and the same as the probability that each time we tossed the die, we got a 2, 3, 4, 5, or 6.) This equals about 48.23%.
     But now, what if we wanted to know the probability that on the first toss we got a one, but then got no ones after that. The probability of getting a one on the first roll is 1/6, the probability of not getting a one on the second roll is 5/6, of not getting a one on the third roll is 5/6, and of not getting a one on the fourth roll is also 5/6. By independence, we can multiply all these together, and we learn that the probability we are looking for is about 9.64%.
     We may also ask what the probability that we do not get a one on the first toss, but we do get a one on the second toss, and we don't get a one on the final two tosses. Here we find that the probability is 5/6 times 1/6 times 5/6 times 5/6, using the same reasoning as in the previous paragraph.
     What if we want to know what the chance is that we get a one on any of the four tosses, but not on the other three? In other words, what is the probability that we get exactly one show of one spot on the die in four throws? We could get the one on the first try (and not on the others), or on the second try (and not on the others), on the third try (and not on the others), or on the fourth try (but not on the others). So the event "we get exactly one showing of one spot" can be written as the union of four events, the event that we get the one on the first throw but not the others, etc. And if we get the one on the first throw but not the others, then we could not have gotten it on the second throw. These four events are mutually exclusive or mutually disjoint. So we can calculate the chance of getting a one exactly once by adding up the probabilities of each of the four ways we could have gotten a one. Since each of these probabilities is the same, we can multiply by four, since four is the number of ways to get one infection out of four possible infections. This happens to be about 38.58%.
     What if we want to know what the chance is that we get a one on exactly two of the four tosses? We can calculate the probability that we get a one on the first toss, on the second toss, not on the third toss, and not on the fourth toss; this is about 1.93%. There are five other ways to choose the tries on which we get the one. Each of these orderings has the same probability. In other words, we might have got the one on the first try, not on the second, not on the third, but seen another one on the fourth try. The chance we would have seen this pattern is the same 1/6 times 5/6 times 5/6 times 1/6. All we need to do is see that if we have two ones that showed up, we have to multiply the chance of getting a one (which is 1/6) by itself two times. Then we get a non-1 on 2 times, and so we have to multiply the probability that we see the non-1 on some trial (which is 5/6) by itself two times. Then we multiply these together, and this gives us the chance of seeing any particular pattern of two 1's and two non-1's. But we don't need the chance of seeing any one of these patterns. We need the chance of seeing at least one of the six patterns that have exactly two 1's. All these have the same probability, so we can figure out what this probability is and multiply it by 6. This winds up to be about 11.57%.
     Now let's go back to the needlestick example. What if we want to know the chance that exactly one person got infected? We know that we can find the chance that the first person was infected and none of the others were by writing 0.03 (that's the chance the first person was infected) by 0.97 (the chance the second person was not), and then by 0.97 again for the third person, and so on down to the 51st person. We multiply together a single factor of 0.03, and fifty factors of 0.97. This happens to be about 0.00654, or about six tenths of one percent. But again, we don't just want the chance of the first person being infected; if only one person is infected, it could have been any of the 51. There are 51 different ways for exactly one person to be infected, so we must multiply the probability of any one ordering by the number of orderings, and this gives us 33.36%. This is the chance that if the chance of Hepatitis C infection following a needlestick injury with contaminated blood is 3%, and 51 people are exposed, that exactly one of the 51 exposed people are infected.
     But what if we want the chance that exactly two people were infected? It could have been the first and the second people who got infected, or the second and the forty-third, and so forth. Let's determine the chance that the first and second people got infected but not the others. This is going to be 0.03, times 0.03, times 0.97 (49 times). Every person who gets infected gives us a factor of 0.03 (the chance of an infection in one trial), and every person who does not get infected gives us a factor of 0.97 (so we get 49 such factors, one for each of the 51 minus 2 who don't get infected.) The chance that the second and forty-third people get infected (and no one else) is the same as the chance that the first and second get infected (and no none else) and so on. This probability is 2.02 times ten to the minus fourth power. Each of the possible ways to get the two infected people has the same probability; if we knew how many ways to choose the two infected people out of the 51 possibilities, we could multiply by this number. By definition, this is called the number of combinations of 51 items taken 2 at a time.

Combinations

     Let's determine what the number of combinations of N things taken r at time is.
     It will be useful to consider permutations - in which order matters - first. Let's start with 4 things, say A, B, C, and D and choose two of them. Let's choose the first one; we have 4 choices: A, B, C, or D. If we chose A, then we have to choose the second object, and we still have three objects (B, C, or D) to choose from. So we might pick AB, AC, or AD. And if we chose the B first, we still have the A, C, or D to choose from; so we might have wound up with BA, BC, or BD. Similarly, we could get CA, CB, or CD, or DA, DB, or DC. So this led to 12 permutations of 4 things taken 2 at at time; they were AB, AC, AD, BA, BC, BD, CA, CB, CD, DA, DB, or DC. We got both AB and BA; there are two ways to arrange the two objects A and B and we got them both.
     Now let's try the 4 things and choose 3 of them. We have four choices for the first one. Once we choose the first one, we have three choices for the second one, and then after this, we have two choices for the third. So for instance, if we chose the B for the first slot, we still have the three items A, C, and D from which to choose the second one; if we then pick the D for the second one, we still have the A and the C left from which to choose the third slot. So there are 4 times 3 times 2 of these ordered choices. They are ABC, ABD, ACB, ACD, ADB, ADC, BAC, BAD, BCA, BCD, BDA, BDC, CAB, CAD, CBA, CBD, CDA, CDB, DAB, DAC, DBA, DBC, DCA, and DCB. Notice that we got ABC, ACB, BAC, BCA, CAB, and CBA. There are 6 ways to arrange the three objects A, B, and C and we got all six.
     If we started with N objects, and we want to choose r of them, then we have N things from which to choose the first. We have N-1 things from which to choose the second. We have N-2 left from which to choose the third, N-3 from which to choose the fourth, and N-(r-1) from which to choose the rth. The total number of these arrangements is N times N-1 times N-2 times all the numbers all the way down to N-(r-1). When we've filled up all the r slots, there are N-r left.
     What if we started with 4 objects and chose all 4 of them (keeping track of the order)? We just did the case of choosing three a paragraph ago. Once we've chosen three, we have only one choice left for the fourth. So we get these 24 arrangements: ABCD, ABDC, ACBD, ACDB, ADBC, ADCB, BACD, BADC, BCAD, BCDA, BDAC, BDCA, CABD, CADB, CBAD, CBDA, CDAB, CDBA, DABC, DACB, DBAC, DBCA, DCAB, and DCBA. There are 4 times 3 times 2 times 1 different orders of these 4 objects.
     If we started with N objects and chose them all (keeping track of the order)? How many different orders? Following the same logic, we find N choices for the first one, N-1 for the second object, and all the way down to 1 choice for the last one. To find out how many, we multiply N times N-1 times N-2 times all the numbers on down to 1. This comes up a lot, and has a name: N! means N times N-1 times N-2 times all the numbers on down to 1; it is read "N factorial".
     Now, we know the number of ordered ways to choose r objects out of a total of N was found by taking N, multiplying it by N-1, and so on down to N-(r-1). Now if we start where we left off and take N-r, and multiply by N-(r+1), and keep going down to one, we will have (N-r)! (the number N-r, factorial); the first number we started with is N-r, the next one is N-(r+1). It must be N-(r+1); what number is one less than N-r? It is N-r-1 which is the same as N-(r+1). We can multiply and divide by (N-r)! without changing anything, and we find that the number of permutations of N things taken r at a time is N!/(N-r)!.
     Now, what if we just want to consider the number of different combinations of objects we might have gotten, regardless of the order? For instance, we calculated the number of ordered ways to choose 3 objects out of a total of 4 objects, and found 24 such ways. But ignoring order, there are only four combinations: ABC, ABD, ACD, and BCD. When we considered the ordering, each of these got counted six times, because there are six different ways to order 3 things. There are six ways to order 3 things, because we have 3 ways to choose the first one, 3-1=2 ways to choose the second, and one way to choose the last; so there are 3 times 2 times 1 which is six different orders. So there are 24 ordered arrangements, and we divide by six to get a total of 4 combinations of 4 things taken 3 at a time. If we want to know how many combinations of N objects taken r at a time there are, we can first figure out how many permutations of N objects taken r at a time there are, and divide by the number of orders of the r objects. But we already know how many ways to order the r objects: this is r times r-1 times all the numbers down to 1, which is r!. So the number of combinations of N things taken r at a time is N!/(r! (N-r)!). This is called "N choose r" and is written like this:

The Binomial Probability Distribution

Returning to the binomial probability distribution, we were in the middle of calculating the chance that exactly two people out of the 51 would become infected after the needlestick injury. We figured out the probability that the first and the second would get infected (but not the others), which was the same as the probability that the second and forty-third would get infected, etc. We just need to know how many ways there are to choose these two infected people out of the 51 possibilities; this is 51 choose 2, which is equal to 51!/(2! 49!). This equals 1,275 (how would you like to have done all these by counting them all?!) Multiplying this by the probability of any one ordering (which was 0.03 multiplied twice (once for each of the 2 infected people) and 0.97 multiplied 49 times (once for each of the 49 uninfected people) gives us 25.8%, which is the chance that we would see exactly 2 infections.
Finally, we can calculate the probability of getting r successes out of N independent Bernoulli trials. First, there are

ways to choose the r successes out of the N trials. We have r factors of p for the successes, and (N-r) factors of (1-p) for the failures. And that is all there is to the binomial formula. This is only one of two formulas we will really discuss in detail in this class; we will build a great deal on it.

Return to statistics page.
Return to probability page.
Return to stochastic seminar.