> Home > Statistics Notes > Probability > Expectation
Probability Distributions
We've spent some time studying the binomial distribution, which gives the
probability of getting r successes in N independent
Bernoulli trials. The sample space of the Binomial distribution is
the set of numbers {0,1,2,...,N-1,N}. Each of the events {r}, where
i is between 0 and N, has a positive (nonzero) probability when
the success probability is not 0 or 1, but somewhere in between. Of course,
numbers larger than N have probability zero - there is no way
in God's green earth to get say 47 successes out of only 30 Bernoulli
trials. But once you know N and p, you can calculate
the probability that r successes are observed in any one
instance of this experiment. All the events {r} are
mutually exclusive, and all events that you could want to take
the probability of can be written as a disjoint union of the events
{r}. So for instance, the probability that there are
k or fewer successes can computed by adding up the probability
that there are 0 successes, 1 success, and so on up to k.
Call the number of successes in N independent Bernoulli trials
X. One says that X is a Binomial random variable,
or that X is a random variable with the Binomial probability
distribution. The random variable takes integer values anywhere
between 0 and N, and is an example of a discrete
random variable. For any of the values that the random variable could
take, the Binomial probability function we've been discussing
tells you the probability that the random variable will take that
particular value.
There are other probability distributions that a discrete random variable
(I'll define this in more detail later) could take. We will discuss
the hypergeometric distribution soon; another, called the
Poisson distribution, has considerable practical importance too
and is discussed in the extra credit projects. There are huge books
filled with all sorts of interesting probability distributions for
different purposes.
Expectation Values
On the previous page, we examined the concept of the sample mean, and
we found that we could express the sample mean as a sum of terms that
look like xfx. Such terms are the
product of the value x times the frequency that x
occurs in the data set. Each term like this adds up (through repeated
multiplication) all the data points whose value is x, and goes
ahead and divides by the number of data points. When using this
formula, you are still adding up all the data values and dividing by
the number of data values, but you are doing so in a different order.
Also recall that if an experiment is replicated independently more and
more times, the relative frequency of some occurrence approaches the
probability that that occurrence will occur. Suppose that we are
doing some experiment that is generating new data, independently.
For instance:
1
1,4
1,4,2
1,4,2,4
1,4,2,4,3
1,4,2,4,3,1
1,4,2,4,3,1,1
and so forth. We could generate running averages as we go. Each new data
point would be averaged in. The first average we have is just the
average of the number 1, which is just 1. Then we have the average of
1 and 4, which is 2.5; the second in our sequence of averages is 2.5.
The third one is the average of 1,4, and 2. This is 2.333 or so. The
fourth is the average of 1, 4, 2, and 4, which is 11/4, and so forth.
We may also keep track of the relative frequency of each of the possible
data values as we go. After the first data point has been received,
we have a relative frequency of 1's which equals 1/1. After the second
data point, the value 1 has a relative frequency of 1/2 and the value
4 has a relative frequency of 1/2. After the third, the value 1 has a
relative frequency of 1/3, the value 2 has a relative frequency of 1/3,
and the value 4 has a relative frequency of 1/3 also. And after the
fourth, the value 1 has a relative frequency of 1/4, the value 2 has
a relative frequency of 1/4, and the value 4 has a relative frequency
of 2/4. After every new data point, we can compute the relative frequency
of occurrences of all the data values.
Since we can think of each new data point (generated independently) as
a new experiment, we know that as more and more experiments are done,
the relative frequency should approach the probability of occurrence of
that data value. But the average is the sum, over all the data values,
of terms like xfx. If each of the
relative frequencies fx is approaching
the probability that x will occur, that is, px
, shouldn't the whole sum start to look like a sum
of terms like xpx? Provided it is
possible to actually calculate this last sum (involving probabilities),
this turns out indeed to be the case.
If we have a discrete random variable X taking values x
with probability px, then the sum of
terms of the form xpx over all values
of x is called the expectation value of the random
variable, if it exists. (The catch is this: if I have a collection
of numbers, I can always calculate their sample mean. But it is possible
that a random variable could take infinitely many different values; the
sample space may be infinite. Sometimes it is possible to add
up infinitely many numbers; sometimes it isn't. For instance, if I
take the sequence 1, 1/2, 1/4, 1/8, 1/16, 1/32, and so forth, I can add
them up, and make the answer as close to 2 as I want by adding more and
more terms. But if I take the sequence 1, 1/2, 1/3, 1/4, 1/5, and so
on, and calculate 1, 1+1/2, 1+1/2+1/3, 1+1/2+1/3+1/4, and so on,
this can get as big as you want it to be if you're willing to add up
enough of these fractions. So sometimes you can add up infinitely
many numbers and sometimes trying to add up infinitely many
little things just doesn't work.)
When examining repeated independent random quantities,
intuitively, it should be reasonable that if the relative frequencies
approach their corresponding probabilities, that the sample mean should
approach the expectation value (if it exists). In some sense, if you
have enough data, it ought
to be possible to be fairly sure that the sample mean is going to be
close to the expectation value (if there is an expectation value).
Results of this form are called "laws of large numbers"; their precise
statement and proof are outside the range of our class. One such
result is called the Weak Law of Large Numbers; another is called
the (Kolmogorov) Strong Law of Large Numbers.
If you write a Bernoulli distribution as having the value 1 when
a success occurs, and zero otherwise, you can calculate its expectation
by 0(1-p) + 1p=p.
Here is another simple example involving sampling with replacement.
Suppose we put 3 red marbles labeled "1" in a box, along with 2 white
marbles labeled "2" and 5 blue marbles labeled "3". If I shake the box
up and draw one at random, the number on the marble is a random variable.
The probability of drawing a red "1" is 3/10, the probability of drawing
a white "2" is 2/10, and the probability of drawing a blue "3" is 5/10.
What is the expectation value of the number on the marble? You can
calculate it by writing down all the values this random variable can
take, multiplying it by the probability it will occur, and adding it
all together. Try this as an exercize.