The Map Concept

Last time we were interested in the logit function, a transformation of proportions or probabilities that arises frequently in epidemiology and biostatistics. We learned that the logit of p is defined to be the logarithm of the odds p/(1-p). We wanted to explore some properties of the logit numerically. We wanted to calculate the logit of some numbers, so we called the logit function over and over again.

We wished to repeat a calculation by applying a function to a collection of values, producing a corresponding collection of results. This is a fundamental way to achieve repetition in programming. For this course, we'll call this the map concept, inspired by the function mapcar in Common Lisp.

The idea of applying a function to a collection of values to produce the corresponding collection of results is just that - an idea. It is a solution to the basic programming problem of how to repeat a task. We can use it to solve the specific problem of how to compute the logit of many probabilities all at once. But there are many other examples. We may wish to repeat a risk analysis for many different values of selected input parameters in a sensitivity analysis. We may wish to repeat a statistical estimation step for many different bootstrap samples of the original data. Conceptually, we just need a function that we could apply to something, and a collection of things to apply the function to.

In the case of logit or sqrt or many (if not most, or even nearly all) functions we might want to use, the result of one computation does not depend on any other calculations. If I want the sqrt of the numbers in a collection consisting of 9, 100, and 49, we know we will obtain 3, 10, and 7. The square root of 9 is always 3, no matter what you just took the square root of, and no matter what you're about to take the square root of. Each compuation is independent of the others.

If you had a computer with several central processing units, you could imagine that you could give the first processor the 9 to take the square root of, the second one the 100, and the third one the 49. They could then work independently and at the same time. Now most of us are working on single CPU systems and this does not happen¹ when we use R. Still, the fact that the computations are independent is conceptually important. Some languages, such as Fortran 90/95 and Sisal, have special constructions for this sort of problem. In principle, if you can promise the computer that the computations do not interact with each other, the machine is free to do the calculations in the manner most efficient for it.

The idea of applying a function independently to a collection of values to produce the corresponding collection of results (what we're calling the map concept may be implemented in many different ways. There are several different collections we will learn about, and there are several different ways that the repeated computation may be achieved. The map concept is a programming idiom, a solution to the problem of how to do certain computations many times. It is not a program, a function, or a construct specific to any particular language.

There are several ways to implement the map concept in R, and other languages provide different features. In R, we shall see that the use of vectorized functions is the most common, or at least the most natural, way of achieving the map concept.

Vectors

The most fundamental collection (or type of plural noun, so to speak) in R is the vector. The vector in R has many capabilities that we will learn over time. Fundamentally, an R vector is a homogeneous collection of data values that is addressable by number. This illustrates a vector of seven elements:

1



2



3



4



5



6



7

In R, a vector is a homogeneous collection. We can have a numeric vector, or we can have a character string vector, or we can have a boolean vector. There are a few others as well. But we can't have a vector that is of mixed type. All the elements have to be of the same type. And by the way, functions are one type of data that R does not make vectors of.

It is also important to realize that the elements of a vector are addressable in sequence, and there are no gaps. If you have an element one, and a fourth element, then you also must have a second and a third element.

Enough talk. Let's see some actual vectors. What if I want a vector with a 5 in the first cell, a 28 in the second cell, and a -3.2 in the third? Or what if I want a "CA" in the first cell, an "NV" in the second, an "AZ" in the third, and an "OR" in the fourth? How can I make my own vectors to be anything I want them to be?

The function c() collects its arguments together and constructs a vector from them:

> new.vector <- c(5,28,-3.2)
> new.vector
[1] 5.0 28.0 -3.2
> is.vector(new.vector)
[1] TRUE
> is.numeric(new.vector)
[1] TRUE
>

This constructed a new vector, which we called new.vector, out of the three arguments. Of course, we could have named the new vector anything at all, and normally we want to choose a name that conveys a meaning to a human reader.

> systolic.bp <- c(130,128,141,121,108)

Here, you see that the function c() can take any number of arguments; they are all collected together into a vector. Since in the previous two examples the arguments were numeric, c() constructed us a numeric vector.

But we can also collect character (string, text) data into vectors as well:

> states<-c("CA","NV","AZ","OR")
> is.vector(states)
[1] TRUE
> is.numeric(states)
[1] FALSE
> is.character(states)
[1] TRUE

So the function c() produced a vector, but the vector is not numeric. It is a character vector.

We can also have boolean (logical) vectors:

> bool.vector <- c(TRUE,FALSE,TRUE)
> is.vector(bool.vector)
[1] TRUE
> is.numeric(bool.vector)
[1] FALSE
> is.character(bool.vector)
[1] FALSE
> is.logical(bool.vector)
[1] TRUE

Vectors can have only one type of data in them, in R. The function c() will try to construct a vector of one type out of its arguments. If all the arguments are numeric, then the vector is numeric. If all the arguments are logical/boolean, then the vector is logical/boolean. If all the arguments are character, then the vector is character. But if some arguments are numeric, and others character, the numeric arguments get converted to character, and you get a character vector. If some arguments are boolean and others numeric, the boolean TRUE becomes 1 and the FALSE becomes 0, and you get a numeric vector. If some arguments are boolean and others character, the boolean TRUE becomes "TRUE" and FALSE becomes "FALSE", and you get a character vector. Vectors have to have only one type of data, and the function c() will coerce values to other types to make this happen. Here are some examples:

> c(1,"aa")
[1] "1" "aa"
> c(4,TRUE)
[1] 4 1
> c("aa",TRUE)
[1] "aa" "TRUE"
> c(4,"aa",TRUE)
[1] "4" "aa" "TRUE"
>

Here is another useful function which creates a vector. In this case, we use the colon operator : to produce specific vectors, namely sequences of values:

> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
>

The colon operator has produced the sequence of values starting at 1 and ending at 10. But we can start anywhere and end anywhere:

> 39:2
[1] 39 38 37 36 35 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15
[26] 14 13 12 11 10 9 8 7 6 5 4 3 2
>

When a vector is long, R breaks up the result over many lines. Since the first line starts with the first value, R displays a [1]; the second line begins with the 26th value, and R begins it with a [26] to orient you to what you are seeing. Now you know what the [1] you've been seeing from time to time means. We have seen that the colon operator produces a sequence from beginning to end (by 1), and that this sequence is represented by a vector. There are other ways to create a vector, but the colon is one simple function that returns a vector.

Let's assign a vector to a variable:

> a.new.sequence <- 4:9
> a.new.sequence
[1] 4 5 6 7 8 9
> length(a.new.sequence)
[1] 6
>

Here, we created a sequence using the colon operator, and saved it in the variable called a.new.sequence. We used a new function called length to ask R how long it is, i.e. how many elements it has in it. As you can see, the vector has six elements in it; the starting and ending values differ by 5, so you can think of it as a fence that is five units long, with each number being a fencepost. With five units or lengths of fence, you need six fenceposts, so the vector is of length six².

Let's look at the use of is.vector, a function that will tell us if something is a vector:

> a.new.sequence <- 4:8
> a.new.sequence
[1] 4 5 6 7 8
> is.vector(a.new.sequence)
[1] TRUE
> is.vector(7)
[1] TRUE
> is.vector(TRUE)
[1] TRUE
> is.vector("CA")
[1] TRUE
> is.vector(sqrt)
[1] FALSE
>

This is rather interesting. Notice that a.new.sequence produces a TRUE when is.vector is applied. The function is.vector produces TRUE when its argument is a vector. But so does 7: the single number 7 is in reality a vector with length one. The same is true for the boolean literal TRUE: it is interpreted as a boolean vector of length one. And similarly for the string literal "CA": it is also a vector of length one. These are numeric, boolean, and character string vectors (respectively) to be sure, but they are all vectors. And now you see why the [1] is printed so often: when you type 2+2, the answer is actually a vector, and the answer is printed like every other vector - with the starting point of each line of data indicated in brackets at the beginning. Even the result of is.vector, a TRUE or FALSE, is a boolean vector of length one and the [1] is duly printed.

But notice that is.vector returned FALSE for the sqrt function. Functions can't be put into vectors in R, and sqrt is an object which is NOT a vector of length one. No function in R is a vector.

So wait a minute. If I just said 2+2 produces a vector of length one:

> answer <- 2+2
> length(answer)
[1] 1
> is.vector(answer)
[1] TRUE
>

then the plus sign returns a vector as the answer. And if 4 is a vector, what about the 2's that we added together. Aren't they vectors? In fact, they are. The arithmetic operators in R are automatically designed to handle vectors of numbers and are said to be vectorized.

> answer <- (1:10)+(4:13)
> is.vector(answer)
[1] TRUE
> answer
[1] 5 7 9 11 13 15 17 19 21 23
>

The system cheerfully added the elements of each sequence elementwise. There is much more to be said about R's vectorized operators and functions, and as we shall see, possibly the most important way to realize the map concept (parallel repetition) in R is by the use of vectorized functions³.

The next thing we need to learn is how to reference individual elements of a vector. This is done with the bracket notation:

> systolic.bp <- c(130,128,141,121,108)
> systolic.bp[1]
[1] 130
> systolic.bp[4]
[1] 121
> systolic.bp[8]
[1] NA

Here, we type the name of the vector followed by an open bracket, the number of the element we wish to refer to, and then a close bracket. The system then returns the element of interest. In the example, we first asked for element one, which is 130. Then we asked for element 4, which is 121. (Note that R addresses elements by number starting with 1, rather than with offsets (starting with 0), like C does.) When we asked for element 8, we got NA, a missing value code standing for not available; we will say more about the missing value code later.

We can use the bracket notation to assign values as well:

> u <- 1:10
> u[2] <- 23
> u
[1] 1 23 3 4 5 6 7 8 9 10
>

What if we assign to an element that does not exist? Let's begin with a vector of length five, and for definiteness the first element will be a nine, the second an eight, and so forth. Then I am going to try to assign the value 100 to the sixth element.

> u <- 9:5
> u[6] <- 100
> u
[1] 9 8 7 6 5 100
>

There was no sixth element, but R created one for you and put the value 100 in it. Some languages don't allow this sort of thing, but R vectors are expandable in size. R will expand a vector if you ask it to assign to elements that have not yet been created.

What if we do the same example, but assign to element seven? We have a vector of length 5, and so there is no sixth element and no seventh. We're jumping past the sixth element and assigning to position 7:

> u <- 9:5
> u[7] <- 100
> u
[1] 9 8 7 6 5 NA 100
>

Here, the needed seventh position is created, and you can't have a seventh position without a sixth. There is nothing to put in the sixth position, which is simply filled with the missing value code NA. The seventh position (element 7) is filled with the numeric value 100, as requested.

With character vectors, gaps like this are filled with empty strings:

> u <- c("D","P","H")
> u[5] <- "F"
> u
[1] "D" "P" "H" "" "F"
>

What if you try to assign a character to an element in a numeric vector?

> u <- c(1,2,5)
> u[5] <- "F"
> u
[1] "1" "2" "5" "" "F"
>

Vectors can only contain one type of data, so the numeric values in the original data set were coerced to characters. Then, the character vector was expanded, undefined locations being filled with empty strings.

Question: explain the following behavior:

> u <- c(1,2,5)
> u[5] <- 8
> u
[1] 1 2 5 NA 8
> u[7] <- "F"
[1] "1" "2" "5" "NA" "8" "" "F"
>

Question: how are undefined elements in boolean vectors filled as you expand beyond the end of the vector? Make up a simple boolean vector of length 3, and then assign to the fifth element. What is the fourth element of the result? Then assign the sixth element with a numeric value. What are the other elements coerced to? Go back and make up a new boolean vector of length three, and then assign element 4 to be a character string. Now what are the boolean values coerced to when the vector is changed to a character vector?

Using vectorized functions and operators

Let's begin by taking the square root of different numbers. We will place the values into a collection, and apply a function to each element in the collection, resulting in a corresponding collection of results. Recall that we call this idea the map concept. In R, this concept is frequently implemented with a vectorized function, i.e. a function which automatically operates on vectors. The function sqrt will take the square root of every number in a vector of numbers, and return a vector of the answers. Only once is sqrt called, though internally many square roots are computed.

> sqrt(c(1,2,3,4))
[1] 1.000000 1.414214 1.732051 2.000000
> sqrt(1:4)
[1] 1.000000 1.414214 1.732051 2.000000
>

Other mathematical functions in R behave in a similar way:

> abs(c(-1,1.8,-0.5,2,-80.3,-1e-7))
[1] 1.00e+00 1.80e+00 5.00e-01 2.00e+00 8.03e+01 1.00e-07
>

Here, I had a collection of numeric values. The function abs was applied to the collection, and it took the absolute value of all of them and returned the collection of absolute values. In this case, the collection was an R vector, and the vectorized function abs returns a vector of corresponding absolute values.

And the logarithm function is vectorized too:

> log(c(1,2.7182818,10))
[1] 0.000000 1.000000 2.302585
> vals <- c(1,2.7182818,10)
> log(vals)
[1] 0.000000 1.000000 2.302585
>

Remember that when called with one argument, the log function computes the natural logarithm of its argument. When that single argument is a vector, then you get a vector of the corresponding logarithms. Only one single call to log occurs, and you see an example of what we're calling the map concept.

It is of fundamental importance to understand that in the logarithm example above, the function was called with only a single argument. That argument was a vector produced by the function c; this vector is a compound object, a collection containing more than one value. But it is still a single object and it constitutes a single argument for the log function. You can see this more clearly in the second part of the example, where we assign the vector to the variable vals and call the log function with vals as the single argument.

We alluded earlier to the fact that the basic arithmetic functions were vectorized. Let's look again at some examples. We'll first define two example vectors, xx and yy, and then we'll look at arithmetic operations with them.

> xx <- c(8,5,9,10,2,-7,-2.2)
> yy <- c(1,5,-6,0,-8,3,-1)
> xx+yy
[1] 9.0 10.0 3.0 10.0 -6.0 -4.0 -3.2
> xx-yy
[1] 7.0 0.0 15.0 10.0 10.0 -10.0 -1.2
> xx*yy
[1] 8.0 25.0 -54.0 0.0 -16.0 -21.0 2.2
> xx/yy
[1] 8.000000 1.000000 -1.500000 Inf -0.250000 -2.333333 2.20000
>

Each of these operates in an elementwise manner. When two vectors of the same length are added together by the plus operator, the first element of one vector is added to the first element of the other, producing the first element of the result. Then, the second element of one vector is added to the second element of the other, producing the second element of the result. This continues until the last element of the first vector is added to the last element of the second, producing the last element of the result. This is true vector addition as defined by the mathematicians. And note that each of the fundamental arithmetic operators works the same way.

It is important to pay special attention to the behavior of * and /. When applied to two vectors of equal length, the * operator produces an elementwise (or Hadamard) product, not any other product you may have heard of (though we'll soon enough see how to compute other products).

Let's see what happens when you place a vector on one side of an arithmetic operator and a single number (a vector of length one) on the other side:

> xx <- 1:7
> 2*xx
[1] 2 4 6 8 10 12 14
> xx/2
[1] 0.5 1.0 1.5 2.0 2.5 3.0 3.5
> xx
[1] 1 2 3 4 5 6 7
>

These multiply or divide all the elements in a vector (in the example, the vector xx) by a constant (in the example, 2). This is ordinary scalar multiplication as the mathematicians define it. Of course, the xx itself is unchanged after the operation and this is the semantics you would expect. We took the xx and we took a constant, and we computed a new result from it (the product, or quotient). The new result was printed to the screen, though of course we could have assigned it to a variable and used it for something else.

We should also say that you can have the single number in the numerator and the long vector in the denominator:

> xx <- 1:7
> 4/xx
[1] 4.0000000 2.0000000 1.3333333 1.0000000 0.8000000 0.6666667 0.5714286
>

The system has taken each element in the vector xx, divided 4 by it, and returned the vector of these quotients.

When the vectors are of different lengths, the arithmetic operators repeat the shorter one until they get a vector that is as long as the longer one, and then the elementwise operation is performed. This is what is happening in the examples we just did. It is interesting to experiment with the behavior of the system using different examples, but we will come back to this in more detail later.

Let's now look back at the addition and subtraction operators.

> xx <- 1:7
> xx+2
[1] 3 4 5 6 7 8 9
> xx-5
[1] -4 -3 -2 -1 0 1 2
> 10-xx
[1] 9 8 7 6 5 4 3
> -1-xx
[1] -2 -3 -4 -5 -6 -7 -8
>

Notice that xx+2 added two to each element of xx and produced a vector of the corresponding sums. This is another example of what we're calling the map concept, as are all the examples we've seen so far.

Now, let's look back at our definition of the logit function. Remember that we defined it in R as:

> logit<-function(pp){log(pp/(1-pp))}
>

Suppose that the argument pp is not a single number, but instead a vector whose length is greater than one. Remember that a single number is just a vector of length one, and we know what this function will do. Now if pp is of length greater than one, what happens? First, the subtraction 1-pp is going to happen: the first element of pp is subtracted from one, producing the first element of a result. Then the second element of pp is subtracted from one, producing the second element of the result, and so forth, until a new vector 1-pp has been computed element by element. The next thing that is going to happen is the divison pp/(1-pp). When the division is to be done, we have already computed 1-pp. The division is going to happen elementwise and the result is going to be a vector of the same length as pp (and the same length as 1-pp). The first element of the new vector is going to be pp[1] divided by 1-pp[1], and so on. This produces the vector of odds from the vector of probabilities or proportions, pp. Finally, the logarithm function will receive this vector of odds as its single argument, and as you have seen, will compute the corresponding vector of (natural) logarithms. Since this vector is the last (in fact the only) expression evaluated in the body of the function, it is returned as the result of the function call. We have computed the logit of each element in a vector and returned the vector of logits; our logit function is vectorized (with no extra effort on our part) simply because we wrote it using R's vectorized arithmetic operators and mathematical functions.

Wouldn't it be nice to calculate the logit of a lot of values between 0 and 1 and plot them all? Let's take a moment to learn how to do this. This will introduce a new useful function, and it will give us a chance to learn how to produce a plot.

First, let us learn how to produce a sequence of evenly spaced values, collected together in a vector. We already know the sequence operator, so we could get ten evenly spaced values by using the colon operator and elementwise division:

> (1:9)/10
[1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
>

To make a nice graph, we might want to pick values closer together. We could start at 0.01 and go up on steps of 0.01 by computing (1:99)/100. For convenience, R provides the function seq to create such sequences. We will start at 0.01, go through 0.99, in steps of 0.01:

> xx <- seq(0.01,0.99,by=0.01)
>

The first argument of seq is the starting value, the second argument the ending value, and the by=0.01 instructs R to produce values that are separated by 0.01. We have seen this construction in function calls before, but we have not said much about it (we will soon). How long is the new vector xx that we just constructed using seq? You know that it must have 99 elements in it. We stayed away from 0 and 1 at the ends so we won't wind up dividing by zero or taking the logarithm of zero in the logit function.

Once we have this vector of values to take the logit of, we can take the logit of them all at once because as we have seen, our logit function is vectorized. We are repeating a computation by applying a function to a vector of values and calculating a vector of corresponding logits in an instance of the map concept. The calculation is done with a single call to the vectorized function logit. We no longer need to use duplication to do the job of repetition:

> xx <- seq(0.01,0.99,by=0.01)
> yy <- logit(xx)
>

Now, I want to show you how to do a simple line plot. We will use the plot function. The plot function has many options and can do a very great deal, and we're just going to scratch the surface today. We're going to give it three things: the vector xx to be plotted on the horizontal axis, the vector yy to be plotted on the vertical axis, and finally an instruction to produce a line plot (to join the points up with a line). Here it all is:

> xx <- seq(0.01,0.99,by=0.01)
> yy <- logit(xx)
> plot(xx,yy,type="l")
>

Here is what the plot looks like:

Notice that the system labeled the x and y axes with xx and yy, the names of the arguments of the function plot. As we will see later, you can label the axes any way you wish.

Random Numbers

Now that we know about vectors, it's a good time to introduce some of R's features for statistical computation. We will begin by learning how to generate a collection of values from the normal distribution. Recall that the normal distribution describes or at least approximates a variety of types of observations, typically when the value is the result of many tiny random and independent effects. For instance, if someone fires a bullet at a target, there are lots of tiny air currents, muscle twitches or other vibrations in the person who aims it, irregularities in the bore of the gun the bullet must travel through, irregularities in the surface of the projectile that create differences in the airflow around it, and so on. In public health, the height of a person may be influenced by many genetic and environmental factors. The normal distribution has played a central role in biometrics and biostatistics since the work of the pioneers Galton, Pearson, Gossett, and Fisher and its applications are almost limitless. All of you have seen the normal distribution, sometimes called a Gaussian distribution, and occasionally called a "bell curve" (though there are lots of distributions that have a bell-shaped curve).

Recall that a normal distribution is characterized by a mean and a standard deviation. The mean is a measure of the central tendency of the observations, and the standard deviation a measure of the spread of the data or observations around the mean. For today, I want to show you how to simulate systolic blood pressure measurements from a population with a mean systolic blood pressure of 130 millimeters of mercury and a standard deviation of 5 millimeters of mercury. I wish to simulate 100 such individuals; in other words, I want a vector of 100 random observations taken from a normal distribution with mean 130 and standard deviation 5. The R function rnorm accomplishes this:

> bp <- rnorm(100, mean=130, sd=5)
> length(bp)
[1] 100

Then let's plot a histogram of these newly-generated simulated data points using the R function hist, and let's have 10 classes or bins in the histogram:

> bp <- rnorm(100, mean=130, sd=5)
> hist(bp, nclass=10)
>

The result is shown below. Notice that R has labeled the histogram with the name bp, and this too can be changed once we see how.

These data, summarized in the histogram, are randomly generated, and if you ask R for a different set of random numbers, you will get a different histogram, as we will demonstrate in class.

Next time we will move further with our study of vectors. We will learn about operations that work on whole vectors to produce numbers, such as mean, sum, median, and var. We will also learn about how the comparison operators such as < and == are vectorized. We will learn about the vector properties of the boolean operators such as & and |, and how these can be used to perform fundamental database operations in R vectors such as selecting data values that exceed a certain cutoff, or removing missing values from a data vector. Finally, we will see how R represents tables (multidimensional arrays).

¹ The parallel processing revolution is one of those things that always seems to be right around the corner. But actually, parallel computation is done every day. See Numerical Recipes in Fortran 90: The Art of Parallel Scientific Computing by W. H. Press, S. A. Teukolsky, W. T. Vetterling and B. P. Flannery, Cambridge University Press, 1999.

² This fencepost example is actually widespread among programmers. Quick: if you do a computation starting with 5 and ending at 3709, how many computations do you do? The answer is 3709-5+1, or 3705. Sometimes people forget to add the one in at the end (so they would say 3704, which is worng). This is called a fencepost error.

³ In many languages, it is necessary to use a function that might be called map or something like map to automatically apply a particular function separately to each element of a collection to produce a collection of results. Such higher-order functions realize or implement the idea we're calling the map concept just as well as R's vectorized functions and operators (though more function calls may be needed). In fact, such techniques are useful from time to time in R, though vectorized functions are in a sense the most native idiom of R (and of course its predecessor language, S).