Home About us Mathematical Epidemiology Rweb EPITools Statistics Notes Web Design Search Contact us |
We have seen repetition accomplished by features of R which support the map idea, i.e. applying a function to each element of a collection, producing a collection of corresponding results. We have seen repetition implicitly in the filter operations we learned for vectors, in the fold functions such as sum and any, and in the use of outer. But there are problems for which these techniques are not enough, and we will need tools which support sequential repetition.
R has several features which support this kind of programming. We will learn several of these today.
Last time we looked briefly at the definition of factorial. The factorial of N, N!, is defined as the result of multiplying N by N-1 by N-2, on down to 1. For instance, 4! = 4 × 3 × 2 × 1.
We could define N! as being equal to 1 if N equals 1, and otherwise equal to N(N-1)! (for integers 2 and above). So if I wanted to compute 5!, I could just take 4! and multiply it by 5. Then to compute 4!, I could take 3! and multiply it by 4. To compute 3!, I multiply 2! by 3. Then 2! is two times one factorial, which is one. This is the classic textbook example of a recursive definition.
How can we implement this definition in R? We will have to be able to test whether a number equals one or not; if a number equals one, we will do one thing, and if a number is greater than one, we will do something else.
Before we go any farther, we're going to have to talk about how to make a decision using R. We've already seen how to do boolean comparisons. And we've seen how we can select elements from a vector (or table!) based on a boolean comparision. So in a way, we've been making decisions already. But this is not really enough. Suppose I want to do a computation if some condition is true, and not otherwise? I want to decide what to do, not find data.
Those of you who have programmed in other languages will recognize this as being a job for what is usually called an if/else statament. Let's do a simple example. If a variable aa is negative, then I want to set vv to be the text "negative", and if a variable is not negative, I want to set vv to be the text "nonnegative". We use the if statement:
> aa <- 3
[1] "positive" > |
This is really all there is to it. Let's do another simple example, a simulated random coin toss. We've already seen random numbers generated using rnorm, which generates simulated random samples from a normal (Gaussian) distribution. Another important function in R is called sample, which takes random samples from a vector. As we've been doing so far, we'll only show a few features of the function, and defer the full details until later.
So the sample space for a coin toss is heads and tails; we could represent these by "H" and "T", and so the sample space itself by the vector of these two character values. I'd like a sample of size one taken from this sample space. So here is how you do it:
> sample(c("H","T"),1) [1] "H" > |
Let's try this: let's toss the coin three times, and if all results are heads, we'll print a special message. We will toss the coin three times using sample, by using 3 as the second argument. Here, after we select one value from the sample space, we'll have to put it back for the next selection; we will have to make sure we replace the values we choose so they will be there the next time. We will do this with an additional clause replace=TRUE in the call to sample. So here it is:
|
Exercise. Suppose that a person has a one in three chance of acquiring the hepatitis B infection from a given exposure to contaminated blood. Simulate the exposure of a single person under this assumption; print the message "Infected" if the person is infected, and "Not infected" otherwise.
Exercise. Again suppose that a person has a one in three chance of acquiring the hepatitis B infection from a given exposure to contaminated blood, and a one in thirty chance of acquiring the hepatitis C infection from the same exposure. Suppose that the two infections are independent of each other, so that the chance of getting both is the chance of getting hepatitis B times the chance of getting hepatitis C. Simulate the exposure of a single person to both infections, and print "Coinfection" if they get infected with both, and print "No coinfection" otherwise.
Exercise. Simulate four independent coin tosses again as we did earlier. If there is exactly one head (no less than one, no more than one), then print the message "exactly one head"; otherwise, print "not exactly one head".
We will learn more about if as we proceed.
Let's try to define the factorial function then, recursively. Remember that we can always compute factorials using prod; we could compute 7! according to prod(1:7) for instance. But recursively, we define N! by saying that N! equals 1 if N equals 1, and otherwise it equals N(N-1)!.
So let's try it. Remember that to define a function, we use the form function(arglist){body}. The last expression evaluated in the body of the function becomes the return value. It's going to start like this:
> factorial <- function(nn) { Compute the factorial in here. + } > |
Now what are we going to do? We will have to check if the argument nn is equal to one or not. Then the result is the last expression in the body. We call the factorial function itself (within the body), but we call it for a smaller value of the argument.
> factorial <- function(nn) { + if (nn==1) { + result <- 1 + } else { + result <- nn*factorial(nn-1) + } + result + } > factorial(5) [1] 120 > |
How does this actually work? When we give the function a 5 as the argument, what happens? The first function call determines that 5 is not 1, and takes the else branch. It decides to multiply 5 by the result of factorial(4). A new copy of the function is called, this time with argument 4. As before, the function will determine that 4 is not 1, and take the else branch. So then it will decide to multiply 4 by the result of factorial(3). Another copy of the function will be called, and it will decide to multiply 3 by factorial(2). Then yet another copy of the function will decide to multiply 2 by factorial(1). This last version simply returns a 1; there are no more function calls after that. That 1 is returned to get multiplied by 2, so that result got set to 2. The two gets returned to get multiplied by three producing a new result of 6. This continues until the last function call returns the value of 120.
One way to get a better handle on this is to add some print statements to the body of the function. In R, print is a function which prints its argument:
> print("this string") [1] "this string" > |
> factorial <- function(nn) { + print("factorial called with value:") + print(nn) { + if (nn==1) { + result <- 1 + } else { + result <- nn*factorial(nn-1) + } + print("factorial returning value:") + print(result) { + result + } > factorial(5) [1] "factorial called with value:" [1] 5 [1] "factorial called with value:" [1] 4 [1] "factorial called with value:" [1] 3 [1] "factorial called with value:" [1] 2 [1] "factorial called with value:" [1] 1 [1] "factorial returning value:" [1] 1 [1] "factorial returning value:" [1] 2 [1] "factorial returning value:" [1] 6 [1] "factorial returning value:" [1] 24 [1] "factorial returning value:" [1] 120 [1] 120 > |
This function turns out not to be very efficient; prod(1:5) is actually better. Not only that, our factorial function does not conform to good R style: it does not handle a vector of input gracefully. Rather than fix it, let's just leave it as the simple introductory example that it is, and make sure you understand the technique of achieving repetition by defining a function in terms of itself - calling itself on a simpler version of the problem. And that when you do so, you need to have a base case where things finally stop.
Let's do a few simple examples, just for practice.
How about sum? We already have a sum function built in, but let's write one anyway for practice. If we have a vector of length one, then the sum of the elements is just the only element itself. If the list is longer than one, we could add the first element to the sum of all the rest of the elements.
> my.sum <- function(a.vector) { + if (length(a.vector)==1) { + result <- a.vector + } else { + result <- a.vector + my.sum(a.vector[-1]) + } + result + } > my.sum(c(1,2,3,4)) [1] 10 > |
How about reversing the elements of a vector? R has a builtin function called reverse too. Let's write our own for practice. If we had a vector with only one element in it, it would already be reversed. If I have a longer vector, I could take the first element, and take the rest of the vector. If we were to have reversed the rest of the vector, we could just put the first element at the end and we'd be done:
> my.reverse <- function(a.vector) { + if (length(a.vector)==1) { + result <- a.vector + } else { + result <- c(my.reverse(a.vector[-1]),a.vector[1])) + } + result + } > my.reverse(c(1,2,3,4)) [1] 4 3 2 1 > |
I'd like to begin by introducing a simple example which we'll use several times. I'd like to invite you to consider a needle which is being reused from patient to patient for blood draws. As all of us in public health know, this is contrary to accepted standards, because of the possibility of transmitting serious infections. We're going to consider quite a simple example which will help us understand programming.
Suppose that the needle has an infinite lifetime, that is, it can be used over and over forever. Of course, real needles wear out, and would be thrown away after they became dull. Suppose that the probability that a person is infected is p, and that the infection status of successive individuals are independent.
So let's start with an uninfected needle, and imagine that it is used on a person. We're going to assume that whenever a needle is used on an infected person, that the needle becomes contaminated. So after one use on a random person, what is the chance the needle is contaminated? It is equal to the chance the person was infected, and we've decided that that chance is p. If p for instance equals 0.2, then the needle has a 20% chance of being infected.
What comes next in the life of our needle? We suppose the needle is decontaminated, or rinsed, or at least flushed, before reuse. If a needle is not contaminated with the particular disease we're modeling, then the decontamination does not change its status. If the needle is contaminated, then we assume that with probability q the needle is decontaminated.
So over time, the needle would fluctuate between contamination and non-contamination. If decontamination were perfect, then the needle would always be clean before the next use, no matter how high the contamination probability were to be. If decontamination is almost perfect, then the needle would almost always be clean before the next use. Now, suppose that no one were infected to start with; then, the needle would never become contaminated with the agent we're modeling, and it would not matter whether or not it was decontaminated as far as this particular agent is concerned.
So the long-term status of the needle will result from a balance between the contamination probability p and the decontamination probability q.
So let's think about it. Suppose X(i) is the chance the needle is contaminated before the i-th use. We know the needle is not contaminated when it's new; the chance of contamination at the beginning is zero: X(1)=0.
Now, suppose the chance of being infected at the i-th use is known; we're calling it X(i). What is the chance it is infected before the next use? Let's use some common sense. What is the chance the needle will become infected given that it is not infected now? We know that is p(1-q). Why? This is the chance it will be used on an infected person, but will not be decontaminated.
But there's more. What is the chance that the needle will be infected next time given that it is infected now? In our simple model, it does not matter whether or not it is used again on another infected person; we're not modeling the degree of contamination. So the needle will be infected provided the decontamination did not work, i.e. with probability 1-q.
So either the needle is infected now or it isn't. If it's infected now, then with probability 1-q it will be infected next time. If it isn't infected now, then with probability p(1-q) it will be next time. Remember that the chance the needle is infected now is called X(i), and so the chance it isn't infected now has to be 1-X(i). So: the chance it's infected now times the chance it will still be infected next time given it's infected now, plus the chance it's not infected now times the chance it will be infected next time given it's not infected now is X(i)(1-q)+(1-X(i))p(1-q), and this is the chance it will be infected next time, X(i+1).
So if I were writing this in R, and the current value of the contamination probability were called xx, the contamination probability were pp, and the decontamination probability were qq, then we compute the next value of the needle contamination probability by something like this:
> newx <- (1-qq)*(xx+(1-xx)*pp) > |
So if I know the contamination probability at one time, I can compute it for the next time. Since I know it at the beginning, I can compute it for the first reuse. Then, I can compute it for the second reuse from that, and so on. But this is different than the map concept, because we have to do the computations in order1.
Note that the formula to compute the next contamination probability does not actually depend on the use number i. In a real public health setting, this may or may not be a useful assumption.
Let's think about how to program this. I'd like to be able to plot the contamination probability over time. We have a very different repetition problem here; we have a starting value, then a function that transforms this to the next one, and then we apply the same function to that value to get the next one, and so on. Actually, this is rather similar to the fold concept, isn't it? Unfortunately we don't have an R built-in function to do the job for us the way sum folds the addition through a vector. We're going to have to learn a new technique.
So let's think about it. How long do we want this repetition to go on? Certainly not forever. Let's decide we want to see 20 uses. So starting with an uncontaminated needle, let's plot the probability of contamination for each of the first 20 uses.
What do we expect the answer to look like? I'd like a vector of length 100, and I want the i-th element to be the chance of contamination before the i-th use. So perhaps we should have a function that accepts the current contamination probability, the current time, and the stopping time, and returns us a vector of the sequentially computed contamination probabilities. This would be nice.
But how are we going to keep track of it? How can we build up this vector? We already know how to elongate a vector; we can just use the c() function. So if the answers so far were stored in a vector called answers, and the new value were called newx, then we could use c(answers,newx) to just append newx to the vector. We're going to have to start at the beginning, building up (somehow) a list of answers. We start with the current time set to 1, representing the state of the needle before the first use. Each time we add a new value to the running vector of answers, then we will move the current time up by one. When the current time equals the end time, we're done, and we could just return the list of answers we've built up.
So we might imagine something like this:
> needle <- function(answers,curtime,endtime,pp,qq) {
|
But of course this is not a function yet. We can't just tell the computer "Somehow compute what the answers should be." How can we compute the other answers?
We can just use the needles function itself - the very function we are defining! What we are going to do is call needles on a simpler problem. We'll compute one answer, add it to the list of answers we're building up, increase the current time, and just call needles again. It seems like cheating, but it will work.
> needle <- function(answers,curtime,endtime,pp,qq) {
|
And now let's try this function out. We'll first do 10 needle uses, with a contamination probability of 0.2, and a decontamination probability of 0.1. Then we'll plot the contamination probability for the first 20 uses.
> needle(0,1,10,0.2,0.1) [1] 0.0000000 0.1800000 0.3096000 0.4029120 0.4700966 0.5184696 0.5532981 [8] 0.5783746 0.5964297 0.6094294 > res <- needle(0,1,20,0.2,0.1) > plot(res,type="l") |
So this works. We've learned a new way to repeat a computation. Namely, we can ask a function to call itself on a simpler version of the problem, building up the answer as we go. When we reach the end, we simply return the values we've built up.
We could do many more examples of repetition using this method, called recursive function definitions. We can do any possible repetitive calculation in this way. But while it is often the most convenient solution to a problem, it is not always convenient and it is not always efficient. Moreover, R does not allow you to do a great deal of repetion this way; it does not like to nest function calls too deeply. So we will need to learn some alternative methods for doing repetition, and so next time we will learn about while and for loops.
Some languages use loops exclusively, and don't allow recursive function calls at all for repetition (such as FORTRAN 77). Other languages seem to rely almost entirely on recursive function calls and provide minimal or even no support for loops (such as ML and Scheme). R gives us the choice, and we'll learn more next time.