Some Probability Theory

Quick Highlights

While this is not a probability class, we need some probability theory if we're going to be simulating random processes.

Of course, the random processes we're interested in are things like the chance of becoming infected following a needlestick injury, or the chance of developing TB from latent or inactive infection.

I'd like to first remind you all that probability is a set function defined on suitable collections of sets, and that it has certain properties which we'll talk about.

First, because (in the frequentist world) probability is defined to be a long-run limiting relative frequency, it has to be between zero and one, inclusive. What does this mean? Suppose I have some event that might or might not occur when an experiment is run. If I repeat the experiment N times, suppose the event happens A times. (Of course, what does it mean to repeat the experiment? Why don't I get the same thing?) So the relative frequency is A/N. Since A can be no smaller than 0 and no larger than N, the relative frequency is bounded by 0 and 1 inclusive. So therefore probability, meant to represent a long-run limiting relative frequency, must be between zero and one; this follows because of the way limits are defined. These and related matters are beyond the scope of our course, but the concept is in fact commonsensical.

Probabilities are always defined on certain collections, starting with a sample space U, a collection of possible events that could occur. Then various subsets of U have probabilities associated with them. Always P(U)=1; the whole sample space has to have a probability and that probability has to be one; every time you do the experiment you get something in the sample space, so the relative frequency is always one and the limit has to be one. Similarly the empty set always has a probability of zero. So you have a bunch of subsets of U that are going to have probabilities associated with them; these subsets are called events. For all elementary problems (with finite sample spaces) the set of all subsets of U is the collection of events.

The basic rule is that probability is additive; if two sets don't overlap, the probability of their union is the sum of the probabilities. (The formal rule states that any countable union of events is an event and that when all the events are disjoint, the probability of the union is the sum of the probabilities).

A consequence of this is that if A and A^c (A complement) have no elements in common (they are disjoint) and the union of A and A^c is U itself, then 1=P(U)=P(A u A^c)=P(A)+P(A^c). So P(A^c)=1-P(A). This is easy to see from the relative frequency standpoint, and it just says the chance of something not happening is one minus the chance of it happening.

Exercise. Suppose that the union of the event A and the event B is the event C, and that A and B are disjoint. Prove P(B)=P(C)-P(A).

I'll provide some links to other material on this web site that reviews basic probability in more detail.

Cumulative Distribution Functions

We're often interested random variables that take values either

in a finite set (such as {success,fail}, {positive, negative, indeterminate}, or {1,2,3,4,5,6})
in a countably infinite set (such as 0, 1, 2, 3, 4 ...)
in the set of real numbers.

We use the set of real numbers as one mathematical model of continuous measurement, but of course in all practical experiments the precision is finite and the true sample space is finite.

The Bernoulli distribution is a very important probability distribution, having only two outcomes, traditionally called success and failure, as we've seen, and having a single parameter: p, the probability of success.
Exercise. What is another distribution with a finite sample space?

The geometric distribution, which we've seen, is a distribution with a countably infinite sample space. Soon we'll see another, the Poisson.

The normal distribution has a real-valued sample space and is used all the time in statistics; you've seen it before. The normal distribution is said to be absolutely continuous. Another absolutely continuous distribution we'll use all the time is the uniform, and another is the exponential.

We'll also see examples of mixed distributions. There are also distributions called singular distributions; these are usually cooked up by mathematicians, using fractal distribution functions (like the Cantor function). We won't spend time on them here, but for those of you who study advanced probability, know there are interesting things out there to learn.

So let's get to work. Suppose X is a random variable. If the sample space U is finite or countably infinite, we can represent it as the collection u1, u2, and so forth. If we specify the probability that X=u1, X=u2, and so on, for every element in the sample space, then we've completely specified the distribution. This is called the probability function.

But if the distribution is continuous (I mean absolutely continuous in a certain sense, but won't pedantically talk about this anymore; for mathematical rigor you know where Evans Hall is), this isn't good enough. The problem is that the probability of getting any particular single value is always exactly zero. Think about the chance a normal random variable would equal zero. All digits in the decimal expansion out to infinity have to vanish. This has probability zero, although values near zero are more likely than values near any other point. So each element in the sample space has probability zero, but the union of them all has probability one (this is an uncountable union). So probability functions are useless for real-valued random variables in general. What do you do?

What the mathematicians did for us is to set up a different kind of bookkeeping for us. It starts with intervals: if you can say what the chance of the random value falling into any interval is, you can consistently specify a distribution. And for single real random variables, that is what the cumulative distribution function does for you.

A function F is the cumulative distribution function of the real-valued random variable X if F(x)=P(X<=x). It gives you the chance that X is less than or equal to x.

Let's see one. How about the cumulative distribution function of the standard normal? Or at least the middle part of it, around zero:
> xx <- seq(-3,3,by=0.01) > plot(xx,pnorm(xx),type="l") >

From the standpoint of R programming, note that pnorm computes the cumulative distribution function of the standard normal for you. We used the plot command to plot the result, specifying the option type="l" to give us a line plot.

How do we get the chance of being in an interval from a to b? In other words, what is the chance of being between a and b? It is the chance of being less than b but greater than a. (Since the normal is absolutely continuous, the chance of equaling one particular value a or b is always zero).
Exercise. Prove P(a < X < b) = P(X < b) - P(X < a) under the assumption that P(X=a)=0 and P(X=b)=0. Express this in terms of the cumulative distribution function.

Exercise. The cumulative distribution function of the uniform distribution (standard uniform) can be computed by punif in R. Plot this function over the range -1 to 2.

Closely related is the probability density function. The probability density function of X is constructed so that the area under the curve between a and b is the chance of the value of X being in the interval. The area from minus infinity up to a is the chance of being less than or equal to a, and equals the value of the cumulative distribution function, F(a). For those of you who are comfortable with calculus, the integral of the probability density function is the cumulative distribution function.
> xx <- seq(-3,3,by=0.01) > plot(xx,dnorm(xx),type="l") >

The normal density is computed for you by the function dnorm.
This is the familiar bell curve; you've all seen it. As a note, the term bell curve is imprecise; many distributions look roughly the same. If you mean normal distribution, then say normal distribution or gaussian.

Exercise. The function dunif computes the uniform density. Plot this density; what do you see? Try this with different endpoints (use ?dunif to find out how to do this).

Exercise. The gamma distribution comes up all the time in applied probability and statistics. (For example, you've seen a special case of it called the chi-square distribution). Explore the gamma density for different values of the shape and scale parameters, using dgamma. Use ?dgamma to learn how to use the function. After plotting several interesting gamma densities (make sure you try at least the values 0.1, 1, 2, and 5 for the shape parameter), then use rgamma and hist to generate random samples from the same distributions. Compare the histogram with the density functions you plotted.

Quantiles

To keep the discussion simple, let's assume the variable of interest is continuous for now. Now 10% of the time, (on average) a random variable will be smaller than its 10th percentile. And 20% of the time, on average, the random variable will be smaller than its 20th percentile. In general, a random variable will be smaller than its 100-U-th percentile, 100 U percent of the time. Of course we don't have to multiply everything by 100; we could just say that the random variable will be smaller than its U-th quantile a fraction U of the time. This is the whole definition of the percentile-quantile concept.

You're familiar with this from data; you could take a collection, for instance, of test scores, and then determine the test score such that 50% of them fall on one side and 50% on the other. This would be the sample median. The population median is the value such that a randomly chosen value has a 50% chance (probability 0.5) of falling on either side.

In general you can pick a probability value, such as 30% (0.3), and determine the population value such that a randomly chosen value has an 0.3 chance of falling below (being less than) this value, and therefore an 0.7 chance of falling above this value (being greater than this value). This would be the population 30th percentile, or equivalently, the 0.3-quantile. You could determine population quantiles for any value between 0 and 1.

If your population is described by (say) an normal distribution with mean and standard deviation, you can use the R function qnorm to determine this value. The mean defaults to zero and the standard deviation (sd) to 1 (giving you a standard normal). If you want a different mean and standard deviation, you may specify them as shown.
> qnorm(0.025) [1] -1.959964 > qnorm(0.5,mean=50,sd=5) [1] 50 > qnorm(0.2,mean=50,sd=5) [1] 45.79189 > qnorm(0.8,mean=50,sd=5) [1] 54.20811

So for example, what value is such that a standard normal has a 2.5% chance of being less than it? The first call to qnorm tells you it is approximately -1.96, a value familiar to everyone who has studied statistics. In the last line, we determine the 80th population percentile for a population with mean 50 and standard deviation 5, and find that it is 54.2 or so.

So this can go two ways: we can ask what the chance is that a random variable X will be less than or equal to some value (this is computed by the cumulative distribution function). And you can ask what value of the random variable corresponds to a given amount of probability on one side of it (specifically: for what x is P(X<x)=q for various values of q). So the quantile function is essentially the inverse of the cumulative distribution function. (There are some minor irritations when dealing with discrete distributions.)

Now, every time you draw a random value x from some distribution, you could compute the percentile or quantile q corresponding to it. So we'll call the random variable X, and the quantile Q; we get Q by computing F(X). I'd like to think about the distribution of Q. Don't lose track: what is Q? I get a random value from a distribution, and then I find out the chance a random variable from that distribution would be less than (or equal to) it; what percentile value corresponds to it--is it at the 12th percentile (q=0.12), the 81.7th (q=0.817), 48th (q=0.48), or whatever? The value Q is this random value. Let's think about it: if I get some value of X, say x, and find that q=F(x)=0.12, then I know that the chance that X is less than this x is 0.12. But whenever X<x, Q<q! Whenever X gets bigger, Q gets bigger; whenever X gets smaller, Q gets smaller. So I've just learned something about the random values Q: P(Q<q)=q for every possible q.

So I know what the distribution of Q has to be, because I know its cumulative distribution function. And Q is a standard uniform distribution! The slope of the CDF is constant, so every little interval of a given length has the same chance of containing the next random value of Q. So whatever the distribution of X is, the distribution of values of F(X)=Q is always standard uniform.

Exercise. Sample 1000 random values from the standard normal and then transform them according to the cumulative distribution function of the standard normal. Make a histogram of the values. (Use rnorm, pnorm, and hist).

The Probability-Integral Transform

It is important to be able to simulate random variables from a given distribution. Remembering that the cumulative distribution function (CDF) specifies a real valued random variable, it would be nice if there was a way to use the CDF of a random variable to just generate random values from the distribution. Why would you want to do this?

Well, remember last time we were talking about the tuberculosis B-notification cost effectiveness model. We considered the chance that individuals who had radiographic abnormalities at a baseline screening would subsequently develop tuberculosis. We assumed a constant annual risk of reactivation every year. But it turns out that the evidence suggests that the risk actually declines over time. One simple model that has been used (cf Bryan Lewis) is to assume an exponentially declining annual risk of infection every year. Now the time it takes to reactivate is not so simple. One way to model this is to use a loop and rbinom, and I leave this as an exercise. Another way is to try to directly generate random variables from whatever the distribution is; what is the distribution of reactivation times from a declining annual risk of infection? There's not an R builtin function for this, so we'll have to be able to do it ourselves. So simulation from an arbitrary distribution is a common problem.

Luckily, there is a standard way to do this. To keep the discussion simple, let's assume as before that the variable of interest is continuous.

Luckily, there is actually a standard way to do this. The key is to do the reverse of what I just showed you. We know that the distribution of quantile values (Q=F(X)) is always standard uniform, as you just saw. So why not sample from the standard uniform, and then find what value of X would be the given quantile? In other words, if we sample from U and get say, u=0.4, we find the 0.4-quantile (40th percentile), i.e. the value x_0.4 such that the next value of X has a 40% chance of being smaller than x_0.4. We use the inverse of the cumulative distribution function (i.e. the quantile function) to do so; we sample from the standard uniform U, and we then transform these values to the quantiles: Y=F^-1(U). Then Y and X have the same distribution. Exercise. Prove that F^-1(U) and X have the same distribution.

Exercise. Use runif and qnorm to construct random values from the standard normal using the probability-integral transform. Note: it turns out that while the probability-integral transform works for normal random variables, there is a better special-purpose method called the Box-Muller method that is actually used in practice.