> Home > Statistics Notes > Probability > Binomial Distribution
The Binomial Distribution
Bernoulli Trials
Suppose I do an experiment that has only two outcomes. For instance,
suppose I toss a coin and determine whether it is heads or tails. Or I
may sample individuals and determine whether they have a risk factor or
not. Or I may toss a die and determine whether it lands 1 or not (although
the die has 6 faces that could show up, we can think of it in terms of
either landing 1 up, or not landing 1 up - wtwo outcomes.)
We need to be able to distinguish these outcomes. It is traditional to
call one of them "success" and the other "failure". "Success" in this
context need have no connotation of actual success - it is just a convenient
label. So we could think of "success" as rolling a 1, or of finding
the risk factor.
Suppose the probability of success is p. Then the probability
of failure is 1-p.
An experiment which has two outcomes is often called a Bernoulli trial.
So tossing a coin is a Bernoulli trial with success probability 1/2.
We can think of the die experiment above as a Bernoulli trial with
success probability 1/6. We could even treat a needlestick injury
with Hepatitis C contaminated blood as a Bernoulli trial -
with "success" probability 0.03 or so (again, the word "success" in this
context is simply a connotation-free label.)
The Binomial distribution
If one undertakes Nindependent Bernoulli trials, then
the binomial distribution tells you the probability of getting r
successes in these N trials. So for instance, if 51 people
experience a needlestick injury with Hepatitis C contaminated blood, we
may suppose that this could be treated as 51 independent Bernoulli trials,
and use the Binomial distribution to calculate the probability that 0
individuals were infected, 1 individual was infected, and so on all
the way up to 51 individuals infected. (In the real world, other models
for the needlestick injury may be appropriate; perhaps the probability
of infection depends on other factors, such as the volume of blood on
the device or the depth of injury.) Another example of the Binomial
distribution would be to determine the probability of seeing 3 heads
out of 5 tosses of a coin, or seeing a 1 four times if we toss a die
24 times.
We may also be interested in other outcomes. For instance, we may wish
to know the probability that at most 3 people were infected
by Hepatitis C. The phrase at most signifies the largest
number; so saying that "at most 3 are infected" indicates that either
0, 1, 2, or 3 could have been infected. "At most 3" means that 3 is
the "most" number possible. A person who writes on a
questionnaire that they exercize "at most once per week" may exercise
once a week, or not at all. The binomial distribution can tell us
directly the chance of getting 0 infections, and also the chance of getting
1 infection, 2 infections, and 3 infections. This gives us four separate
numbers, one for each of these probabilities. If we want to know the
probability that at most 3 people were infected, we observe that the
event "at most three were infected" is the union of the events
"no people were infected", "exactly one person was infected",
"exactly two people were infected",
and "exactly three people were infected". Moreover, if exactly
two people were infected (say), then it is false that exactly three
people were infected. You can see that each of these four events is
disjoint from the others. So the probability of the union of these four
events is the sum of the probabilities of each of them. This is why
you can just add the four probabilities together, and get the
probability that at most three individuals were infected.
Another example is the probability that "at least 1 person was infected".
The phrase "at least 1" means that 1 is the least possible number; saying
"at least one person was infected" means that 1 person was infected, or
2, or 3, or any other number up to 51 (in this experiment). We could
in fact calculate all these probabilities and add them up; this is sometimes
the easiest thing to do. In this case, we could observe that the
probability that at least one person was infected is one minus the
probability that it is false that at least one person was infected. The
event "it is false that at least one person was infected" is the same
thing as the event "no people were infected". So we can just calculate
this and subtract it from one. For the needlestick problem, it is easier
to calculate a single probability and do the subtraction than it is
to calculate 50 probabilities and add them together.
The binomial formula
Next, we will learn how to actually compute these probabilities. Let us
start with the needlestick example again. We continue to suppose that
each needlestick event is independent of the others, and
that they all lead to the same infection probabilities (so we continue
to treat these as independent Bernoulli trials). Suppose the infection
probability is 3%.
Now, what is the probability that no one was infected? The probability
that the first person is not infected is 97%. The probability that
the second person is not infected is also 97%. And so on; all the
probabilities are 97%. Let's ask what the chance is that both the first
and the second persons were not infected. Because the events are
independent, the probabilities can be multiplied; the probability is then
0.97 times 0.97, which is 0.9409. What is the chance that the
first three are not infected? This is 0.97 times 0.97 times 0.97, which
is about 91%. What is the chance that all 51 people were not
infected? It is what you get when you multiply 0.97 by itself 51 times,
or 0.97 to the 51st power; this number is 15.6%.
Let's take another example. What is the chance that if we toss a fair
die 4 times, we get NO occurrences of a 1? We'll treat this as 4
independent Bernoulli trials. The chance that the first die does
not show a one is 5/6, the chance that the second die does not show
a one is 5/6, and so forth; because the rolls are independent, we
can find the probability that all the rolls failed to show a one (which
is the same as the probability that none of them showed a one, and
the same as the probability that each time we tossed the die, we got
a 2, 3, 4, 5, or 6.) This equals about 48.23%.
But now, what if we wanted to know the probability that on the
first toss we got a one, but then got no ones after that. The
probability of getting a one on the first roll is 1/6, the
probability of not getting a one on the second roll is 5/6, of
not getting a one on the third roll is 5/6, and of not getting a
one on the fourth roll is also 5/6. By independence, we can multiply all
these together, and we learn that the probability we are looking for
is about 9.64%.
We may also ask what the probability that we do not get a one on the
first toss, but we do get a one on the second toss, and we don't get
a one on the final two tosses. Here we find that the probability is
5/6 times 1/6 times 5/6 times 5/6, using the same reasoning as in
the previous paragraph.
What if we want to know what the chance is that we get a one on any
of the four tosses, but not on the other three? In other words, what
is the probability that we get exactly one show of one spot on the
die in four throws? We could get the one on the first try (and not
on the others), or on the second try (and not on the others), on the
third try (and not on the others), or on the fourth try (but not on
the others). So the event "we get exactly one showing of one spot"
can be written as the union of four events, the event that we get
the one on the first throw but not the others, etc. And if we get
the one on the first throw but not the others, then we could not have
gotten it on the second throw. These four events are mutually exclusive
or mutually disjoint. So we can calculate the chance of getting a one
exactly once by adding up the probabilities of each of the four
ways we could have gotten a one. Since each of these probabilities is
the same, we can multiply by four, since four is the number of ways to
get one infection out of four possible infections. This happens to
be about 38.58%.
What if we want to know what the chance is that we get a one on exactly
two of the four tosses? We can calculate the probability that we get
a one on the first toss, on the second toss, not on the third toss, and
not on the fourth toss; this is about 1.93%. There are five other
ways to choose the tries on which we get the one. Each of these
orderings has the same probability. In other words, we might have got
the one on the first try, not on the second, not on the third, but
seen another one on the fourth try. The chance we would have seen this
pattern is the same 1/6 times 5/6 times 5/6 times 1/6. All we need to
do is see that if we have two ones that showed up, we have to multiply
the chance of getting a one (which is 1/6) by itself two times. Then we
get a non-1 on 2 times, and so we have to multiply the probability that
we see the non-1 on some trial (which is 5/6) by itself two times. Then
we multiply these together, and this gives us the chance of seeing any
particular pattern of two 1's and two non-1's. But we don't need the chance
of seeing any one of these patterns. We need the chance of seeing at
least one of the six patterns that have exactly two 1's. All these have
the same probability, so we can figure out what this probability is and
multiply it by 6. This winds up to be about 11.57%.
Now let's go back to the needlestick example. What if we want to know
the chance that exactly one person got infected? We know that we can
find the chance that the first person was infected and none of the others
were by writing 0.03 (that's the chance the first person was infected) by
0.97 (the chance the second person was not), and then by 0.97 again
for the third person, and so on down to the 51st person. We multiply
together a single factor of 0.03, and fifty factors of 0.97. This happens
to be about 0.00654, or about six tenths of one percent. But again, we
don't just want the chance of the first person being infected;
if only one person is infected, it could have been any of the 51. There
are 51 different ways for exactly one person to be infected, so we must
multiply the probability of any one ordering by the number of orderings,
and this gives us 33.36%. This is the chance that if the chance of Hepatitis
C infection following a needlestick injury with contaminated blood is
3%, and 51 people are exposed, that exactly one of the 51 exposed people
are infected.
But what if we want the chance that exactly two people were infected? It
could have been the first and the second people who got infected, or
the second and the forty-third, and so forth. Let's determine the chance
that the first and second people got infected but not the others. This
is going to be 0.03, times 0.03, times 0.97 (49 times). Every person
who gets infected gives us a factor of 0.03 (the chance of an infection
in one trial), and every person who does not get infected gives us a factor
of 0.97 (so we get 49 such factors, one for each of the 51 minus 2 who
don't get infected.) The chance that the second and forty-third people
get infected (and no one else) is the same as the chance that the
first and second get infected (and no none else) and so on. This probability
is 2.02 times ten to the minus fourth power. Each of the possible ways
to get the two infected people has the same probability; if we knew how
many ways to choose the two infected people out of the 51 possibilities,
we could multiply by this number. By definition, this is called
the number of combinations of 51 items taken 2 at a time.
Combinations
Let's determine what the number of combinations of N things taken
r at time is.
It will be useful to consider permutations - in which order
matters - first.
Let's start with 4 things, say A, B, C, and D and choose two of them.
Let's choose the first
one; we have 4 choices: A, B, C, or D. If we chose A, then we have
to choose the second object, and we still have three objects (B, C, or D)
to choose from. So we might pick AB, AC, or AD. And if we chose the
B first, we still have the A, C, or D to choose from; so we might have
wound up with BA, BC, or BD. Similarly, we could get CA, CB, or CD, or
DA, DB, or DC. So this led to 12 permutations of 4 things taken 2 at
at time; they were AB, AC, AD, BA, BC, BD, CA, CB, CD, DA, DB, or DC.
We got both AB and BA; there are two ways to arrange the two objects
A and B and we got them both.
Now let's try the 4 things and choose 3 of them. We have four choices
for the first one. Once we choose the first one, we have three choices
for the second one, and then after this, we have two choices for the
third. So for instance, if we chose the B for the first slot, we still
have the three items A, C, and D from which to choose the second one;
if we then pick the D for the second one, we still have the A and the C
left from which to choose the third slot. So there are 4 times 3 times
2 of these ordered choices. They are ABC, ABD, ACB, ACD, ADB, ADC,
BAC, BAD, BCA, BCD, BDA, BDC, CAB, CAD, CBA, CBD, CDA, CDB, DAB, DAC,
DBA, DBC, DCA, and DCB.
Notice that we got ABC, ACB, BAC, BCA, CAB, and
CBA. There are 6 ways to arrange the three objects A, B, and C and
we got all six.
If we started with N objects, and we want to choose r of them,
then we have N things from which to choose the first. We have N-1
things from which to choose the second. We have N-2 left from which to
choose the third, N-3 from which to choose the fourth, and
N-(r-1) from which to choose the rth. The total number of these
arrangements is N times N-1 times N-2 times all the numbers all the way
down to N-(r-1). When we've filled up all the r slots, there are N-r left.
What if we started with 4 objects and chose all 4 of them (keeping track
of the order)? We just did the case of choosing three a paragraph ago.
Once we've chosen three, we have only one choice left for the fourth.
So we get these 24 arrangements:
ABCD, ABDC, ACBD, ACDB, ADBC, ADCB,
BACD, BADC, BCAD, BCDA, BDAC, BDCA, CABD, CADB, CBAD, CBDA, CDAB,
CDBA, DABC, DACB, DBAC, DBCA, DCAB, and DCBA.
There are 4 times 3 times 2 times 1 different orders of these 4 objects.
If we started with N objects and chose them all (keeping track of the
order)? How many different orders? Following the same logic, we find
N choices for the first one, N-1 for the second object, and all the
way down to 1 choice for the last one. To find out how many, we
multiply N times N-1 times N-2 times all the numbers on down to 1.
This comes up a lot, and has a name: N! means N times N-1 times N-2 times
all the numbers on down to 1; it is read "N factorial".
Now, we know the number of ordered ways to choose r objects out of a total
of N was found by taking N, multiplying it by N-1, and so on down to
N-(r-1). Now if we start where we left off and take N-r, and multiply by
N-(r+1), and keep going down to one, we will have (N-r)! (the number N-r,
factorial); the first number we started with is N-r, the next one is
N-(r+1). It must be N-(r+1); what number is one less than N-r? It is
N-r-1 which is the same as N-(r+1). We can multiply and divide
by (N-r)! without changing anything, and we find that the number
of permutations of N things taken r at a time is N!/(N-r)!.
Now, what if we just want to consider the number of different combinations
of objects we might have gotten, regardless of the order? For instance,
we calculated the number of ordered ways to choose 3 objects out of a
total of 4 objects, and found 24 such ways. But ignoring order, there
are only four combinations: ABC, ABD, ACD, and BCD. When we considered
the ordering, each of these got counted six times, because there are six
different ways to order 3 things. There are six ways to order 3 things,
because we have 3 ways to choose the first one, 3-1=2 ways to choose the
second, and one way to choose the last; so there are 3 times 2 times 1 which is six different orders. So there are 24 ordered arrangements, and we
divide by six to get a total of 4 combinations of 4 things taken 3 at a time.
If we want to know how many
combinations of N objects taken r at a time there are, we can first
figure out how many permutations of N objects taken r at a time
there are, and divide by the number of orders of the r objects.
But we already know how many ways to order the r objects: this is r times
r-1 times all the numbers down to 1, which is r!. So the number of
combinations of N things taken r at a time is N!/(r! (N-r)!). This is
called "N choose r" and is written like this:
.
The Binomial Probability Distribution
Returning to the binomial probability distribution, we were in the
middle of calculating the chance that exactly two people out of the 51
would become infected after the needlestick injury. We figured out the
probability that the first and the second would get infected (but not
the others), which was the same as the probability that the second and
forty-third would get infected, etc. We just need to know how many
ways there are to choose these two infected people out of the 51 possibilities;
this is 51 choose 2, which is equal to 51!/(2! 49!). This equals 1,275 (how
would you like to have done all these by counting them all?!) Multiplying
this by the probability of any one ordering (which was 0.03 multiplied twice
(once for each of the 2 infected people) and 0.97 multiplied 49 times
(once for each of the 49 uninfected people) gives us 25.8%, which is
the chance that we would see exactly 2 infections.
Finally, we can calculate the probability of getting r successes out of N
independent Bernoulli trials. First, there are
ways to choose the r successes out of the N trials. We have r factors
of p for the successes, and (N-r) factors of (1-p) for the failures.
And that is all there is to the binomial formula.
This is only one of two formulas we will really discuss in detail in this
class; we will build a great deal on it.