Home About us Mathematical Epidemiology Rweb EPITools Statistics Notes Web Design Search Contact us |
Last time we discussed the map concept, by which we meant the application of a function to each member of a collection of inputs, yielding a corresponding collection of results. In particular, we looked at a particular type of collection, namely, the numerically-ordered homogeneous collection known in R as a vector. We learned to construct vectors using the function c(), and we also learned that vectors containing sequences can be produced using the colon operator. We learned that the elements of a vector can be addressed by number using the square brackets, and we learned that the important arithmetic operations in R are vectorized.
Today we are going to learn more about R vectors. We will learn more about constructing vectors using c(), as well as some more services provided by the square brackets. We will learn how to select objects from a collection that meet some criterion, which we will call the filter concept. We will learn how R represents tables and data sets. Finally, we will learn about operations such as sum that work on whole vectors of data.
Last time we saw that the arithmetic operations are vectorized in R. The comparison operators are too:
> xx <- c(1.1,4,-3.4,-9) > xx < 2 [1] TRUE FALSE TRUE TRUE > |
All the comparison operators are vectorized:
> xx <- seq(1,3,by=0.5) > xx != 2 [1] TRUE TRUE FALSE TRUE TRUE > |
The net result of a comparison like xx < 2 is to produce a boolean/logical vector of the same length as xx, each element of which is TRUE if the corresponding element of xx is less than 2, and FALSE otherwise. It is as though we had a function which returned TRUE if its argument were less than two and FALSE otherwise, and applied that function to the elements of xx, producing a collection of corresponding results. So just as the vectorized arithmetic operators could be said to implement the map concept, so the vectorized comparison operators implement the map concept as well.
Of course, the comparison operators work well on vectors of equal length too:
> xx <- seq(1,3,by=0.5) > xx >= c(1,1,2,-1,-5) [1] TRUE TRUE TRUE FALSE FALSE > |
The basic boolean/logical operators we talked about are vectorized too. Specifically, & (and), | (or), and ! (not), work on whole vectors and produce vectors of corresponding results.
So for example, we could test whether some values are between zero and one:
> xx <- c(-1,-0.5,0,0.1,0.2,0.999,1.0,1.5,10) > xx > 0 [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE > xx < 1 [1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE > xx > 0 & xx < 1 [1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE > |
It does NOT work to type 0 < xx < 1 the way you do in mathematics. Do each comparison separately and use a boolean operator to combine the results. So 0 < xx & xx < 1 is OK. Of course, this is slightly different than the example I just did. Remember that the opposite of < is >=, and the opposite of > is <=. Forgetting this is an occasional cause of logic errors in programs. Of course, we should never be relying on equality comparisons in the case of real numbers anyway.
Of course, we could have done the example in the previous table in a different way. Let's find values that are not between zero and one. We could either compute !(xx > 0) & (xx < 1), or we could find elements that are either less than zero or greater than one. In other words, not (p and q) is the same thing as (not p) or (notq). If someone is not coinfected with HIV and Hepatitis C, then either they aren't infected with HIV, or they're not infected with Hepatitis C, or both. This is one of the de Morgan rules in logic, and it comes in handy in simplifying logical expressions. Complicated logical expressions with lots of parentheses can get difficult to read, and this increases the chances for error. Simplify, simplify, simplify!1 So let's look at the de Morgan rule in action:
> xx <- c(-1,-0.5,0,0.1,0.2,0.999,1.0,1.5,10) > xx <= 0 [1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE > xx >= 1 [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE > xx <=0 | xx >= 1 [1] TRUE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE > !(xx <=0 | xx >= 1) [1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE > |
There is another de Morgan rule. It turns out that not (porq) is the same as (not p) and (not q). So if your coffee is not warm or fresh (and according to the conventions of English, this means not (warm or fresh), then you know (1) it is not warm, and also (2) it is not fresh. It is not warm and it is not fresh. As an exercize, make up a simple R example, similar to the table just above, to illustrate this de Morgan rule. Make sure you understand how the elementwise operation of the Boolean operators &, |, and ! work, and make sure you understand the use of grouping parentheses in Boolean expressions.
Now that we've reviewed the map concept and talked more about vectorized operations, I'd like to move on and talk about the filter concept: selecting all elements from a collection that meet a criterion, and producing a collection of the selected results. In mathematics this is represented by subset notation (when the collections are sets). But here, we're looking not at abstract sets, but concrete collections represented on the computer. And the only collection we've looked at so far is the vector. So we will need to be able to apply a criterion to each element of a vector and produce a vector of results that meet the criterion. But I want to remind you that the filter concept or pattern is just an idea; it can be realized in different ways for different kinds of collections or criteria. It is something you use to help organize your thoughts and your designs. We will see first the filter idea applied to vectors, but it can be applied in many other ways as well.
Now applying a criterion to each element of a collection is just a map. We apply a function that returns TRUE or FALSE to each element of the collection and produce the collection of results. And you've seen that the vectorized comparison operators do this for us. We don't have to actually write our own boolean function and actually call it once for each element. We don't have to do that, because the comparison operators work on whole vectors of data and produce whole vectors of results. So with our comparison operators and our boolean operators, we should be able to represent any criterion we want. (I haven't showed you any string comparison functions, but you'll see them soon.)
But how do we select items from a vector? We already know how to select items based on their position in a vector. If we want the second element in the vector xx, we would use the number 2 inside square brackets: xx[2]. If we want the third element, then we would type xx[3], and so forth. But what we need here is something quite different. I want to pick an element from some vector if some condition is true.
It turns out that the square brackets will do this for us. We've already seen what happens when you use numbers between the square brackets (i.e. numeric subscripts). (Of course you remember too that when you write xx[3] to get the third element, the three is itself a numeric vector.) To select elements based on a TRUE/FALSE criterion, we will use a boolean vector between the square brackets. Each element of this boolean vector will be TRUE if we want the corresponding element of the first vector, and FALSE if we don't.
> yy <- c(-1,0,1,999) > yy[c(FALSE,TRUE,TRUE,FALSE)] [1] 0 1 > |
We happened to just enter the subscript vector right on site, by constructing it from boolean literals on the spot using c() as you learned last time. We could have saved the boolean vector to a variable if we had wished:
> yy <- c(-1,0,1,999) > choices <- c(FALSE,TRUE,TRUE,FALSE)] > yy[choices] [1] 0 1 > |
Remember one of the nice things about R is that you can usually use an expression that yields a result whereever you could use that result directly. What's really convenient is to just do a comparison right on the spot, right in the brackets, that yields the boolean expression you want. So let's go back to the example I was looking at a moment ago. We have a vector xx of numeric values, and suppose we'd like to select out all the elements between zero and one:
> xx <- c(-1,-0.5,0,0.1,0.2,0.999,1.0,1.5,10) > choices <- xx > 0 & xx < 1 > choices [1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE > xx[choices] [1] 0.1 0.2 0.999 > xx[xx > 0 & xx < 1] [1] 0.1 0.2 0.999 > |
When you use a boolean vector as a subscript, it should be the same length as the vector you're subscripting because this is usually what makes sense. If the boolean vector is too short, it gets duplicated until it is long enough. So you can select the odd-numbered elements from xx by using xx[c(TRUE,FALSE)]; the boolean subscript is only of length two, so it gets duplicated until it is long enough. Then it is used as the subscript.
The subscript can be produced by anything at all; I can select elements of a vector such as xx based on another vector. For instance:
> xx <- c(-1,-0.5,0,0.1,0.2,0.999,1.0,1.5,10) > zz <- 0:8 > xx[zz < 4] [1] -1.0 -0.5 0.0 0.1 > |
So when working with vectors, we can realize the filter concept by using boolean subscripts on the vectors.
Last time we learned that we could construct arbitrary vectors using c():
> xx <- c(5,28,-3.2)
> xx [1] 5.0 28.0 -3.2 > |
But it is also possible to use c() to join vectors together, or add elements.
> xx <- c(5,28,-3.2) > yy <- c(xx,50) > yy [1] 5.0 28.0 -3.2 50.0 > xx [1] 5.0 28.0 -3.2 > zz <- c(8,2.2,-1) > c(zz,xx) [1] 8.0 2.2 -1.0 5.0 28.0 -3.2 > zz [1] 8.0 2.2 -1.0 > xx [1] 5.0 28.0 -3.2 > |
There is nothing wrong with repeating a vector in a call to c() too. This repeats the vector:
> xx <- c(5,28,-3.2)
> yy <- c(xx, xx, xx) > xx [1] 5.0 28.0 -3.2 5.0 28.0 -3.2 5.0 28.0 -3.2 > |
Repeating the elements of a vector is occasionally useful, so R comes with a built-in function for this. Most common is to simply repeat a single value, such as 0 or 1, but it can repeat any vector. The first argument is the vector you want repeated, and the second argument is the number of times you want it repeated. The function rep works for character and boolean values as well.
> ones <- rep(1,5)
> ones [1] 1 1 1 1 1 > yy <- rep(c(2,5),3) [1] 2 5 2 5 2 5 > zz <- rep("CA",3) [1] "CA" "CA" "CA" > |
We've seen how central boolean vector subscripts can be. I'd like to return for a moment though to numeric subscripts, to look at some further features.
Recall that we use the notation xx[3] to select the third element of the vector xx. And as we reminded ourselves, the 3 is itself a perfectly good vector, of length one. It is reasonable to ask what would happen if we used a longer numeric vector as a subscript. Let's try it:
> xx <- c(-1,-0.5,0,0.1,0.2,0.999,1.0,1.5,10) > xx[c(2,4,20,2)] [1] -0.5 0.1 -0.5 > |
Another common operations is to just drop a few elements, by position. You already know how to select elements based on a boolean criterion. So we could exclude the fourth item like this:
> xx <- c(-1,-0.5,0,0.1,0.2,0.999,1.0,1.5,10) > xx[(1:length(xx))!=4] [1] -1.000 -0.500 0.000 0.200 0.999 1.000 1.500 10.000 > |
But dropping a few elements by number is so useful that the R/S designers allow a convenient shortcut. It turns out that if you use a negative integer as a subscript, you can drop that element. There is nothing necessary about allowing this. But we're not doing anything else with negative subscripts, so this gives us a convenient way to do it. Some languages use negative subscripts to count from the other end of the vector, and other languages forbid the use of negative subscripts entirely. So here is the shortcut way to drop the fourth element:
> xx <- c(-1,-0.5,0,0.1,0.2,0.999,1.0,1.5,10) > xx[-4] [1] -1.000 -0.500 0.000 0.200 0.999 1.000 1.500 10.000 > |
Because dropping elements by position and choosing elements by position are two different operations, it could be confusing to try to do both at the same time. R does not allow it. You cannot mix positive and negative subscripts without getting the error message Error: only 0's may mix with negative subscripts.
Remember that if you drop elements or select elements, the position of the others will change. If you depend on numeric subscripts all the time, this could get you into trouble. If we count on the fact that the 10 is element 9 in the above vector xx, then we will be in trouble after we drop element 4, because now the 10 is in position 8. It would be nice to be able to tag items in a vector with identifiers that would stay with them even if other elements are dropped.
It is time to learn about another fundamental service provided by vectors in R. We have already seen how we can use numeric and boolean vectors as subscripts. But character or strings can be used as subscripts too. Let's get started:
> xx <- c(2,5,3,1) > xx["hiv"] <- 0
|
Since there was no item xx["hiv"] before the assignment, the assignment created a space for it. The next open space was element 5, so we created a fifth element. But this fifth element can be reached by name as well as by position:
> xx <- c(2,5,3,1) > xx["hiv"] <- 0
hiv 0 > xx[5] hiv 0 > |
It is fundamental to realize that the string "hiv" is not a member of the vector xx in the previous example. The vector is numeric, and after the assignment it contains five elements, 2, 5, 3, 1, and 0. The first four of these elements can be accessed by numeric position only, because there are no names associated with them (or rather, their name is an empty string). The fifth element has a name as well as a position number.
The use of character strings allows R vectors to be used as an associative array or dictionary. A dictionary allows us to associate items with other items. R vectors allow you to associate numbers with strings, or strings with strings, or boolean values with strings.
You can directly access the vector of names by using the function names():
> xx <- c(2,5,3,1) > xx["hiv"] < 0 > names(hiv) [1] "" "" "" "" "hiv" > names(xx) <- c("gender","risk group","hiv","drug use","hiv") gender risk group hiv drug use hiv 2 5 3 1 0 > xx["hiv"] hiv 3 > |
When are names useful? Names like we've just seen are best used when the vector is being used to collect data values that have different meanings. For instance, we may have four numbers we mean to use in a risk computation. They are all numbers, and so they can be used in a vector. But one may mean the number of needle reuses, one may mean the prevalence, one may mean the transmission risk, and so forth. Semantically, the numbers are not interchangeable. So it is appropriate to use names. Here is how you can assign the names at the time you create the vector, using c():
> scenario <- c(needles=3681,reuses=70,prevalence=0.05,transmission.risk=0.058) > |
1 Thoreau, Walden