Home About us Mathematical Epidemiology Rweb EPITools Statistics Notes Web Design Search Contact us |
Last time we were interested in the logit function, a transformation of proportions or probabilities that arises frequently in epidemiology and biostatistics. We learned that the logit of p is defined to be the logarithm of the odds p/(1-p). We wanted to explore some properties of the logit numerically. We wanted to calculate the logit of some numbers, so we called the logit function over and over again.
We wished to repeat a calculation by applying a function to a collection of values, producing a corresponding collection of results. This is a fundamental way to achieve repetition in programming. For this course, we'll call this the map concept, inspired by the function mapcar in Common Lisp.
The idea of applying a function to a collection of values to produce the corresponding collection of results is just that - an idea. It is a solution to the basic programming problem of how to repeat a task. We can use it to solve the specific problem of how to compute the logit of many probabilities all at once. But there are many other examples. We may wish to repeat a risk analysis for many different values of selected input parameters in a sensitivity analysis. We may wish to repeat a statistical estimation step for many different bootstrap samples of the original data. Conceptually, we just need a function that we could apply to something, and a collection of things to apply the function to.
In the case of logit or sqrt or many (if not most, or even nearly all) functions we might want to use, the result of one computation does not depend on any other calculations. If I want the sqrt of the numbers in a collection consisting of 9, 100, and 49, we know we will obtain 3, 10, and 7. The square root of 9 is always 3, no matter what you just took the square root of, and no matter what you're about to take the square root of. Each compuation is independent of the others.
If you had a computer with several central processing units, you could imagine that you could give the first processor the 9 to take the square root of, the second one the 100, and the third one the 49. They could then work independently and at the same time. Now most of us are working on single CPU systems and this does not happen1 when we use R. Still, the fact that the computations are independent is conceptually important. Some languages, such as Fortran 90/95 and Sisal, have special constructions for this sort of problem. In principle, if you can promise the computer that the computations do not interact with each other, the machine is free to do the calculations in the manner most efficient for it.
The idea of applying a function independently to a collection of values to produce the corresponding collection of results (what we're calling the map concept may be implemented in many different ways. There are several different collections we will learn about, and there are several different ways that the repeated computation may be achieved. The map concept is a programming idiom, a solution to the problem of how to do certain computations many times. It is not a program, a function, or a construct specific to any particular language.
There are several ways to implement the map concept in R, and other languages provide different features. In R, we shall see that the use of vectorized functions is the most common, or at least the most natural, way of achieving the map concept.
The most fundamental collection (or type of plural noun, so to speak) in R is the vector. The vector in R has many capabilities that we will learn over time. Fundamentally, an R vector is a homogeneous collection of data values that is addressable by number. This illustrates a vector of seven elements:
|
|
|
|
|
|
|
In R, a vector is a homogeneous collection. We can have a numeric vector, or we can have a character string vector, or we can have a boolean vector. There are a few others as well. But we can't have a vector that is of mixed type. All the elements have to be of the same type. And by the way, functions are one type of data that R does not make vectors of.
It is also important to realize that the elements of a vector are addressable in sequence, and there are no gaps. If you have an element one, and a fourth element, then you also must have a second and a third element.
Enough talk. Let's see some actual vectors. What if I want a vector with a 5 in the first cell, a 28 in the second cell, and a -3.2 in the third? Or what if I want a "CA" in the first cell, an "NV" in the second, an "AZ" in the third, and an "OR" in the fourth? How can I make my own vectors to be anything I want them to be?
The function c() collects its arguments together and constructs a vector from them:
> new.vector <- c(5,28,-3.2) > new.vector [1] 5.0 28.0 -3.2 > is.vector(new.vector) [1] TRUE > is.numeric(new.vector) [1] TRUE > |
> systolic.bp <- c(130,128,141,121,108) |
But we can also collect character (string, text) data into vectors as well:
> states<-c("CA","NV","AZ","OR") > is.vector(states) [1] TRUE > is.numeric(states) [1] FALSE > is.character(states) [1] TRUE |
We can also have boolean (logical) vectors:
> bool.vector <- c(TRUE,FALSE,TRUE) > is.vector(bool.vector) [1] TRUE > is.numeric(bool.vector) [1] FALSE > is.character(bool.vector) [1] FALSE > is.logical(bool.vector) [1] TRUE |
Vectors can have only one type of data in them, in R. The function c() will try to construct a vector of one type out of its arguments. If all the arguments are numeric, then the vector is numeric. If all the arguments are logical/boolean, then the vector is logical/boolean. If all the arguments are character, then the vector is character. But if some arguments are numeric, and others character, the numeric arguments get converted to character, and you get a character vector. If some arguments are boolean and others numeric, the boolean TRUE becomes 1 and the FALSE becomes 0, and you get a numeric vector. If some arguments are boolean and others character, the boolean TRUE becomes "TRUE" and FALSE becomes "FALSE", and you get a character vector. Vectors have to have only one type of data, and the function c() will coerce values to other types to make this happen. Here are some examples:
> c(1,"aa") [1] "1" "aa" > c(4,TRUE) [1] 4 1 > c("aa",TRUE) [1] "aa" "TRUE" > c(4,"aa",TRUE) [1] "4" "aa" "TRUE" > |
Here is another useful function which creates a vector. In this case, we use the colon operator : to produce specific vectors, namely sequences of values:
> 1:10 [1] 1 2 3 4 5 6 7 8 9 10 > |
> 39:2 [1] 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 [26] 14 13 12 11 10 9 8 7 6 5 4 3 2 > |
Let's assign a vector to a variable:
> a.new.sequence <- 4:9 > a.new.sequence [1] 4 5 6 7 8 9 > length(a.new.sequence) [1] 6 > |
Let's look at the use of is.vector, a function that will tell us if something is a vector:
> a.new.sequence <- 4:8 > a.new.sequence [1] 4 5 6 7 8 > is.vector(a.new.sequence) [1] TRUE > is.vector(7) [1] TRUE > is.vector(TRUE) [1] TRUE > is.vector("CA") [1] TRUE > is.vector(sqrt) [1] FALSE > |
So wait a minute. If I just said 2+2 produces a vector of length one:
> answer <- 2+2 > length(answer) [1] 1 > is.vector(answer) [1] TRUE > |
> answer <- (1:10)+(4:13) > is.vector(answer) [1] TRUE > answer [1] 5 7 9 11 13 15 17 19 21 23 > |
The next thing we need to learn is how to reference individual elements of a vector. This is done with the bracket notation:
> systolic.bp <- c(130,128,141,121,108) > systolic.bp[1] [1] 130 > systolic.bp[4] [1] 121 > systolic.bp[8] [1] NA |
We can use the bracket notation to assign values as well:
> u <- 1:10 > u[2] <- 23 > u [1] 1 23 3 4 5 6 7 8 9 10 > |
What if we assign to an element that does not exist? Let's begin with a vector of length five, and for definiteness the first element will be a nine, the second an eight, and so forth. Then I am going to try to assign the value 100 to the sixth element.
> u <- 9:5 > u[6] <- 100 > u [1] 9 8 7 6 5 100 > |
What if we do the same example, but assign to element seven? We have a vector of length 5, and so there is no sixth element and no seventh. We're jumping past the sixth element and assigning to position 7:
> u <- 9:5 > u[7] <- 100 > u [1] 9 8 7 6 5 NA 100 > |
With character vectors, gaps like this are filled with empty strings:
> u <- c("D","P","H") > u[5] <- "F" > u [1] "D" "P" "H" "" "F" > |
What if you try to assign a character to an element in a numeric vector?
> u <- c(1,2,5) > u[5] <- "F" > u [1] "1" "2" "5" "" "F" > |
Question: explain the following behavior:
> u <- c(1,2,5) > u[5] <- 8 > u [1] 1 2 5 NA 8 > u[7] <- "F" [1] "1" "2" "5" "NA" "8" "" "F" > |
Question: how are undefined elements in boolean vectors filled as you expand beyond the end of the vector? Make up a simple boolean vector of length 3, and then assign to the fifth element. What is the fourth element of the result? Then assign the sixth element with a numeric value. What are the other elements coerced to? Go back and make up a new boolean vector of length three, and then assign element 4 to be a character string. Now what are the boolean values coerced to when the vector is changed to a character vector?
Let's begin by taking the square root of different numbers. We will place the values into a collection, and apply a function to each element in the collection, resulting in a corresponding collection of results. Recall that we call this idea the map concept. In R, this concept is frequently implemented with a vectorized function, i.e. a function which automatically operates on vectors. The function sqrt will take the square root of every number in a vector of numbers, and return a vector of the answers. Only once is sqrt called, though internally many square roots are computed.
> sqrt(c(1,2,3,4)) [1] 1.000000 1.414214 1.732051 2.000000 > sqrt(1:4) [1] 1.000000 1.414214 1.732051 2.000000 > |
> abs(c(-1,1.8,-0.5,2,-80.3,-1e-7)) [1] 1.00e+00 1.80e+00 5.00e-01 2.00e+00 8.03e+01 1.00e-07 > |
And the logarithm function is vectorized too:
> log(c(1,2.7182818,10)) [1] 0.000000 1.000000 2.302585 > vals <- c(1,2.7182818,10) > log(vals) [1] 0.000000 1.000000 2.302585 > |
It is of fundamental importance to understand that in the logarithm example above, the function was called with only a single argument. That argument was a vector produced by the function c; this vector is a compound object, a collection containing more than one value. But it is still a single object and it constitutes a single argument for the log function. You can see this more clearly in the second part of the example, where we assign the vector to the variable vals and call the log function with vals as the single argument.
We alluded earlier to the fact that the basic arithmetic functions were vectorized. Let's look again at some examples. We'll first define two example vectors, xx and yy, and then we'll look at arithmetic operations with them.
> xx <- c(8,5,9,10,2,-7,-2.2) > yy <- c(1,5,-6,0,-8,3,-1) > xx+yy [1] 9.0 10.0 3.0 10.0 -6.0 -4.0 -3.2 > xx-yy [1] 7.0 0.0 15.0 10.0 10.0 -10.0 -1.2 > xx*yy [1] 8.0 25.0 -54.0 0.0 -16.0 -21.0 2.2 > xx/yy [1] 8.000000 1.000000 -1.500000 Inf -0.250000 -2.333333 2.20000 > |
It is important to pay special attention to the behavior of * and /. When applied to two vectors of equal length, the * operator produces an elementwise (or Hadamard) product, not any other product you may have heard of (though we'll soon enough see how to compute other products).
Let's see what happens when you place a vector on one side of an arithmetic operator and a single number (a vector of length one) on the other side:
> xx <- 1:7 > 2*xx [1] 2 4 6 8 10 12 14 > xx/2 [1] 0.5 1.0 1.5 2.0 2.5 3.0 3.5 > xx [1] 1 2 3 4 5 6 7 > |
We should also say that you can have the single number in the numerator and the long vector in the denominator:
> xx <- 1:7 > 4/xx [1] 4.0000000 2.0000000 1.3333333 1.0000000 0.8000000 0.6666667 0.5714286 > |
When the vectors are of different lengths, the arithmetic operators repeat the shorter one until they get a vector that is as long as the longer one, and then the elementwise operation is performed. This is what is happening in the examples we just did. It is interesting to experiment with the behavior of the system using different examples, but we will come back to this in more detail later.
Let's now look back at the addition and subtraction operators.
> xx <- 1:7 > xx+2 [1] 3 4 5 6 7 8 9 > xx-5 [1] -4 -3 -2 -1 0 1 2 > 10-xx [1] 9 8 7 6 5 4 3 > -1-xx [1] -2 -3 -4 -5 -6 -7 -8 > |
Now, let's look back at our definition of the logit function. Remember that we defined it in R as:
> logit<-function(pp){log(pp/(1-pp))} > |
Wouldn't it be nice to calculate the logit of a lot of values between 0 and 1 and plot them all? Let's take a moment to learn how to do this. This will introduce a new useful function, and it will give us a chance to learn how to produce a plot.
First, let us learn how to produce a sequence of evenly spaced values, collected together in a vector. We already know the sequence operator, so we could get ten evenly spaced values by using the colon operator and elementwise division:
> (1:9)/10 [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 > |
> xx <- seq(0.01,0.99,by=0.01) > |
Once we have this vector of values to take the logit of, we can take the logit of them all at once because as we have seen, our logit function is vectorized. We are repeating a computation by applying a function to a vector of values and calculating a vector of corresponding logits in an instance of the map concept. The calculation is done with a single call to the vectorized function logit. We no longer need to use duplication to do the job of repetition:
> xx <- seq(0.01,0.99,by=0.01) > yy <- logit(xx) > |
Now, I want to show you how to do a simple line plot. We will use the plot function. The plot function has many options and can do a very great deal, and we're just going to scratch the surface today. We're going to give it three things: the vector xx to be plotted on the horizontal axis, the vector yy to be plotted on the vertical axis, and finally an instruction to produce a line plot (to join the points up with a line). Here it all is:
> xx <- seq(0.01,0.99,by=0.01) > yy <- logit(xx) > plot(xx,yy,type="l") > |
Now that we know about vectors, it's a good time to introduce some of R's features for statistical computation. We will begin by learning how to generate a collection of values from the normal distribution. Recall that the normal distribution describes or at least approximates a variety of types of observations, typically when the value is the result of many tiny random and independent effects. For instance, if someone fires a bullet at a target, there are lots of tiny air currents, muscle twitches or other vibrations in the person who aims it, irregularities in the bore of the gun the bullet must travel through, irregularities in the surface of the projectile that create differences in the airflow around it, and so on. In public health, the height of a person may be influenced by many genetic and environmental factors. The normal distribution has played a central role in biometrics and biostatistics since the work of the pioneers Galton, Pearson, Gossett, and Fisher and its applications are almost limitless. All of you have seen the normal distribution, sometimes called a Gaussian distribution, and occasionally called a "bell curve" (though there are lots of distributions that have a bell-shaped curve).
Recall that a normal distribution is characterized by a mean and a standard deviation. The mean is a measure of the central tendency of the observations, and the standard deviation a measure of the spread of the data or observations around the mean. For today, I want to show you how to simulate systolic blood pressure measurements from a population with a mean systolic blood pressure of 130 millimeters of mercury and a standard deviation of 5 millimeters of mercury. I wish to simulate 100 such individuals; in other words, I want a vector of 100 random observations taken from a normal distribution with mean 130 and standard deviation 5. The R function rnorm accomplishes this:
> bp <- rnorm(100, mean=130, sd=5) > length(bp) [1] 100 |
Then let's plot a histogram of these newly-generated simulated data points using the R function hist, and let's have 10 classes or bins in the histogram:
> bp <- rnorm(100, mean=130, sd=5) > hist(bp, nclass=10) > |
Next time we will move further with our study of vectors. We will learn about operations that work on whole vectors to produce numbers, such as mean, sum, median, and var. We will also learn about how the comparison operators such as < and == are vectorized. We will learn about the vector properties of the boolean operators such as & and |, and how these can be used to perform fundamental database operations in R vectors such as selecting data values that exceed a certain cutoff, or removing missing values from a data vector. Finally, we will see how R represents tables (multidimensional arrays).
1 The parallel processing revolution is one of those things that always seems to be right around the corner. But actually, parallel computation is done every day. See Numerical Recipes in Fortran 90: The Art of Parallel Scientific Computing by W. H. Press, S. A. Teukolsky, W. T. Vetterling and B. P. Flannery, Cambridge University Press, 1999.
2 This fencepost example is actually widespread among programmers. Quick: if you do a computation starting with 5 and ending at 3709, how many computations do you do? The answer is 3709-5+1, or 3705. Sometimes people forget to add the one in at the end (so they would say 3704, which is worng). This is called a fencepost error.
3 In many languages, it is necessary to use a function that might be called map or something like map to automatically apply a particular function separately to each element of a collection to produce a collection of results. Such higher-order functions realize or implement the idea we're calling the map concept just as well as R's vectorized functions and operators (though more function calls may be needed). In fact, such techniques are useful from time to time in R, though vectorized functions are in a sense the most native idiom of R (and of course its predecessor language, S).