Home About us Mathematical Epidemiology Rweb EPITools Statistics Notes Web Design Search Contact us |
Last time we learned about R's list structure, which can be used to create a heterogeneous collection. We learned that we could create them using the constructor function list, and that we could access the elements by using either the double brackets [[]] or using the dollar sign $.
Let's just look at another example:
> first.list <- list(1,2,3) > second.list <- list(4,5,6) > third.list <- list("CA",c(4,5),list(4,"zz",TRUE))) > first.list [[1]] [1] 1 [[2]] [1] 2 [[3]] [1] 3 > first.list + second.list Error in first.list + second.list : non-numeric argument to binary operator > |
Here you see that you may assign to lists:
# Continued from above... > first.list[[1]] <- c("aa","bb") [[1]] [1] "aa" "bb" [[2]] [1] 2 [[3]] [1] 3 > |
We've discussed the map idea at length, namely, applying a function to a collection of values to produce a collection of corresponding results. Well, a list is another kind of collection, so how can we map a function over one of R's lists? The right way to do it is to use the function lapply, read "ell-apply":
# Continued from above... > lapply(second.list,sqrt) [[1]] [1] 2 [[2]] [1] 2.236068 [[3]] [1] 2.449490 > |
# Continued from above... > lapply(second.list,function(x){x^2}) [[1]] [1] 4 [[2]] [1] 9 [[3]] [1] 16 > |
Here is another example. Here, we applied the length function to every member of the list third.list. The first member of third.list is a vector of length one, and thus has a length of one. The second member of third.list is a vector of length two, and thus has a length of two. The third member of third.list is a list of length three, and thus has a length of three. The function length works for lists as well as vectors.
# Continued from above... > lapply(third.list,length) [[1]] [1] 1 [[2]] [1] 2 [[3]] [1] 3 > length(third.list) [1] 3 > |
Here is wilder example:
> (function(f){f(2)})(sqrt) [1] 1.4142136 > lapply(list(sqrt,log,function(x){x^2}),function(f){f(2)}) [[1]] [1] 1.4142136 [[2]] [1] 0.6931472 [[3]] [1] 4 > |
Of course, we could also work harder and do this sort of thing manually using a for loop. Let's create an empty list by calling list() with no arguments, and then add the elements one by one:
> a.list <- list(4,5,6) > answer.list <- list() > answer.list list() > for (ii in 1:length(a.list)) { +answer.list[[ii]] <- sqrt(a.list[[ii]]) + } > answer.list [[1]] [1] 2 [[2]] [1] 2.236068 [[3]] [1] 2.449490 > |
Another useful function is called sapply. This is a little different, since sapply will return a vector of answers if it can. Let's do a few experiments to see how sapply works and how it differs from
> a.list <- list(4,5,6) > sapply(a.list,sqrt) [1] 1.000000 1.4142136 1.732051 > lapply(a.list,sqrt) [[1]] [1] 1.4142136 [[2]] [1] 0.6931472 [[3]] [1] 4 > |
> a.list <- list(c("aa","bb"),c(TRUE,TRUE,FALSE),1:100) > sapply(a.list,length) [1] 2 3 100 > lapply(a.list,length) [[1]] [1] 2 [[2]] [1] 3 [[3]] [1] 100 > |
How about this? Let's get the dimensions of each table, out of a list of tables:
> a.list <- list(rbind(c(2,3),c(4,5)),rbind(c(1,1,2),c(3,4,5),c(3,4,7))) > a.list [[1]] [,1] [,2] [1,] 2 3 [2,] 4 5 [[2]] [,1] [,2] [,3] [1,] 1 2 3 [2,] 3 4 5 [3,] 3 4 7 # Here, a.list really is a list of tables. > sapply(a.list,dim) [,1] [,2] [1,] 2 3 [2,] 2 3 > lapply(a.list,dim) [[1]] [1] 2 2 [[2]] [1] 3 3 > |
The data frame is the workhorse data structure for statistical analysis in R (and S). While this class really emphasizes programming rather than statistical analysis, understanding the data frame is important if you are to make the most effective use of R. The manual page lists the data frame as the "fundamental data structure used by most of R's modeling software" (version 1.4.1, Linux).
Fortunately, the data frame is essentially like a table (array), which we have already seen, except that the data frame can contain elements of different types. Let's create a simple data set of states, populations (2001 Census estimate), and median household income (1999 estimate, US Census Bureau). We'll just do this for 5 states:
> state <- c("California","Nevada","Tennessee","Rhode Island","Alaska") > abbrev <- c("CA","NV","TN","RI","AK") > population <- c(34501130,2106074,5740021,1058920,634892) > income <- c(47493,44581,36360,42090,51571) > state.data <- data.frame(abbrev,population,income,row.names=state) > state.data
|
# continued from above > state.data$population [1] 34501130 2106074 5740021 1058920 634892 > state.data$income [1] 47493 44581 36360 42090 51571 > state.data["Alaska",] abbrev population income Alaska AK 634892 51571 |
# continued from above > state.data[[2]] [1] 34501130 2106074 5740021 1058920 634892 |
# continued from above > state.data[2:3,]
|
# continued from above > state.data[state.data$population > 2000000,]
|
You can also use the constructor data.frame to add variables to a data frame. Here, we will add a column of data representing the fraction of the population 65 and older in the year 2000:
> state <- c("California","Nevada","Tennessee","Rhode Island","Alaska") > abbrev <- c("CA","NV","TN","RI","AK") > population <- c(34501130,2106074,5740021,1058920,634892) > income <- c(47493,44581,36360,42090,51571) > state.data <- data.frame(abbrev,population,income,row.names=state) > frac.over65 <- c(0.106,0.110,0.124,0.145,0.057) > state.data.2 <- data.frame(state.data,frac.over65) > state.data.2 abbrev population income frac.over65 California CA 34501130 47493 0.106 Nevada NV 2106074 44581 0.110 Tennessee TN 5740021 36360 0.124 Rhode Island RI 1058920 42090 0.145 Alaska AK 634892 51571 0.057 |
One simple way to drop a column from a data frame is to assign the column the value of NULL:
state <- c("California","Nevada","Tennessee","Rhode Island","Alaska") > abbrev <- c("CA","NV","TN","RI","AK") > population <- c(34501130,2106074,5740021,1058920,634892) > income <- c(47493,44581,36360,42090,51571) > state.data <- data.frame(abbrev, population,income,row.names=state) > frac.over65 <- c(0.106,0.110,0.124,0.145,0.057) > state.data.2 <- data.frame(state.data,frac.over65) > state.data.2$income <- NULL > state.data.2 abbrev population frac.over65 California CA 34501130 0.106 Nevada NV 2106074 0.110 Tennessee TN 5740021 0.124 Rhode Island RI 1058920 0.145 Alaska AK 634892 0.057 > names(state.data.2) [1] "abbrev" "population" "frac.over65" |
NULL is a very special literal we have not discussed yet. The R language definition says that it is used "whenever there is a need to indicate or specify that an object is absent". To determine whether or not something is NULL, you may use the function is.null:
> #continuing from previous example: > is.null(state.data) [1] FALSE > is.null(state.data$ufo) [1] TRUE > is.null(state.data$income) [1] FALSE > is.null(state.data.2$income) [1] TRUE |
zz *lt;- 1:10 > zz[4] <- NULL Error in "[<-"(*tmp*, 4, value = NULL) : incompatible types Execution halted |
To test whether something is a data frame or not, you may use the function is.data.frame:
# continued from above > is.data.frame(state.data) [1] TRUE |
Normally, you read in a data frame from a file. The specific way this is done depends on whether you are on UNIX, Linux, Windows, or a MacIntosh. The Windows GUI system has some special commands you will see in the lecture; here, I show you how to use the command line interface.
Suppose that you have the following information in a text file:
abbrev | population | income |
CA | 34501130 | 47493 |
NV | 2106074 | 44581 |
TN | 5740021 | 36360 |
RI | 1058920 | 42090 |
AK | 634892 | 51571 |
We will use the command read.table to read in the file:
> states <- read.table("state.txt",header=TRUE) > states
|
Another common format for data is the comma separated format. Suppose now we have the data as follows in a file called "state.csv":
abbrev,population,income CA,34501130,47493 NV,2106074,44581 TN,5740021,36360 RI,1058920,42090 AK,634892,51571 |
> states <- read.csv("state.csv",header=TRUE) > states
|
These functions (read.table and read.csv) have considerably more features, some of which we will discuss later in this course.
It's time to tell the whole truth about subscripting vectors and lists. Remember vectors are homogeneous collections, and lists are not. Lists are recursive in that they can contain lists within themselves. We have learned the subscript operator [] (usually used for vectors), and [[]] (usually used for lists). The main difference is that [] can be used for subsetting operations (selecting more than one object from a collection), while [[]] can be used for hierarchical selection.
It turns out that you can, in fact, use [] for lists:
> a.list <- list(1:3,c("CA","AZ"),TRUE) > a.list[1] [[1]] [1] 1 2 3 |
> a.list <- list(1:3,c("CA","AZ"),TRUE) > a.list[c(3,1)] [[1]] [1] TRUE [[2]] [1] 1 2 3 |
And you can use [[]] for vectors as well. But [[]] needs to yield only a single element of a vector; a vector is not a hierarchical structure.
zz <- 1:10 > zz[[2]] [1] 2 > zz[[c(3,2)]] Error: attempt to select more than one element Execution halted |
There are some other subtle differences:
zz <- c(aa=2,bb=3,uu=8) > zz["aa"] aa 2 > zz[["bb"]] [1] 3 > zz$uu NULL |
The empty vector subscript is occasionally useful for setting all the elements of a vector to some value without destroying component names:
zz <- c(aa=2,bb=3,uu=8) > zz[] aa bb uu 2 3 8 > zz[] <- NA > zz aa bb uu NA NA NA |
We have already seen the use of the componentwise boolean operators & and | (and of course !). These operate on entire vectors of boolean values. Thus, we see
> c(TRUE,TRUE,FALSE,FALSE) & c(TRUE,FALSE,TRUE,FALSE) [1] TRUE FALSE FALSE FALSE |
But for determining the logical flow of your program, we have seen that in an if statement, you must make a single decision (at least in a single-threaded environment such as the one we are using!).
Moreover, a very common pattern is to check a condition and then perform a computation provided a condition is met. For instance, suppose we have a function to compute the incidence rate by dividing a number of cases by a number of person-years at risk. We will suppose that the function will produce an error if it gets a population value of zero; in our case, we will simply check for a value of zero and abort the computation. But remember that computations can fail for any number of reasons, so this provides us an example of a computation that fails for certain inputs. So we will check for those inputs, or we think that's what we are going to do. We have a function called inc that will fail if the denominator is zero. The idea is that we wish to only call it whenever the population is nonzero.
inc <- function(cases,py) { + if (any(py<=0)) { + stop("no person-years at risk") + } else { + cases/py + } + } > nn <- 0 > pop <- 0 > if (pop>0 & inc(nn,pop)>0.05) { + cat("incidence rate over 5%\n") + } Error in inc(nn, pop) : no person-years at risk Execution halted |
What we really want is to evaluate the first argument, and only if it yields TRUE is it even worth trying to evaluate the second argument; if the first argument is FALSE the logical result of and must be FALSE. Such an operator exists, and is called &&. This operator only evaluates the second argument if the first is TRUE. If the first argument is TRUE, the second is evaluated, and if the second yields TRUE the result of the && is TRUE and it is FALSE otherwise. So this has quite different semantics in that the second argument is not even evaluated unless it is needed. This way, you can prevent yourself from attempting computations that will fail in an easy way. You can also use this to avoid expensive computations. So in this example, we have
> inc <- function(cases,py) { + if (any(py<=0)) { + stop("no person-years at risk") + } else { + cases/py + } + } > nn <- 0 > pop <- 0 > if (pop>0 && inc(nn,pop)>0.05) { + cat("incidence rate over 5%\n") + } else { + cat("incidence rate not over 5%\n") + } incidence rate not over 5% |
It is also worth noting that && produces a vector of length one, that is, a single boolean result. This operator is designed for use in determining control flow; it is for use in if and while expressions. The componentwise operator & is designed for data manipulation and boolean subscripting.
There is a short-circuit or operator too, called ||. Here, the first argument is evaluated. If it is TRUE, then the value of the expression is TRUE and the second argument is not even evaluated. If the first argument is FALSE, then the second argument is evaluated and its value is the value of the entire expression.
Normally you separate statements by putting one on each line. You may place several statements on the same line if you separate them with semicolons, as you will see in the next example.
It's time to tell the whole truth about the braces {} too. The braces collect a sequence of statements together to form a compound statement. The value of the compound statement is the value of the last expression evaluated within the compound statement.
> zz <- {aa <- 8; 7} > zz [1] 7 > aa [1] 8 |
> zz <- {aa <- 8 + 7} > zz [1] 7 > aa [1] 8 |
This is not normally what you do with a compound statement. You have already seen them in action:
needle <- function(answers,curtime,endtime,pp,qq) { + if (curtime >= endtime) { + answers + } else { + xx <- answers[curtime] + newx <- (1-qq)*(xx+(1-xx)*pp) + needle(c(answers,newx),curtime+1,endtime,pp,qq) + } + } |
needle <- function(answers,curtime,endtime,pp,qq) { + if (curtime >= endtime) { + answers + } else { + xx <- answers[curtime] + newx <- (1-qq)*(xx+(1-xx)*pp) + needle(c(answers,newx),curtime+1,endtime,pp,qq) + } + } |
Here is another example from before, with the compound statement in red:
for (ii in c(3,9,4,7)) { + cat("The square of ",ii," is ", ii^2,".\n") + } |
And finally of course we've seen compound statements in the while loop:
> ntrials <- 1 > while (sample(c("H","T"),1,replace=TRUE)=="H") { + ntrials <- ntrials+1 + } |
The switch function is occasionally helpful. It allows you to selectively evaluate one of several expressions depending on the first argument.
> x <- switch(2, "one","two","three") > x [1] "two" |
That example does not quite exhibit all the power of switch. Only one of the expressions is even evaluated:
> x <- switch(2, {cat("doing first...\n");"one"}, + {cat("doing second...\n");"two"}, + {cat("doing third...\n");"three"}) doing second... > x [1] "two" |
So switch allows a multiway branch based on a particular value. Of course this is usually achieved by a series of if/else branches.
You can also use names and character strings in a switch statement. Here is an example taken from the R language definition illustrating the use of the switch:
> center <- function(thedata,ctype) { + switch(ctype, + mean=mean(thedata), + median=median(thedata)) + } > center(c(1,1,4),"mean") [1] 2 > center(c(1,1,4),"median") [1] 1 |
> center <- function(thedata,ctype) { + if (ctype=="mean") { + mean(thedata) + } else if (ctype=="median") { + median(thedata) + } + } > center(c(1,1,4),"mean") [1] 2 > center(c(1,1,4),"median") [1] 1 |
Next lecture, we will cover a few more details about function calls, the ifelse function, and two useful variants of the assignment operator. Then we will do a few small examples and look at the needle reuse case study.