More on Lists

Last time we learned about R's list structure, which can be used to create a heterogeneous collection. We learned that we could create them using the constructor function list, and that we could access the elements by using either the double brackets [[]] or using the dollar sign $.

Let's just look at another example:

> first.list <- list(1,2,3)
> second.list <- list(4,5,6)
> third.list <- list("CA",c(4,5),list(4,"zz",TRUE)))
> first.list
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3
> first.list + second.list
Error in first.list + second.list : non-numeric argument to binary operator
>

Here you see that while the arithmetic operator + is vectorized, it does not automatically work on lists.

Here you see that you may assign to lists:

# Continued from above...
> first.list[[1]] <- c("aa","bb")
[[1]]
[1] "aa" "bb"

[[2]]
[1] 2

[[3]]
[1] 3
>

We've discussed the map idea at length, namely, applying a function to a collection of values to produce a collection of corresponding results. Well, a list is another kind of collection, so how can we map a function over one of R's lists? The right way to do it is to use the function lapply, read "ell-apply":

# Continued from above...
> lapply(second.list,sqrt)
[[1]]
[1] 2

[[2]]
[1] 2.236068

[[3]]
[1] 2.449490
>

Here, we applied the sqrt function to every element of the list second.list and produced a list of the corresponding results. Here is another example:

# Continued from above...
> lapply(second.list,function(x){x^2})
[[1]]
[1] 4

[[2]]
[1] 9

[[3]]
[1] 16
>

Note carefully that we did not change second.list; we created a new list from it. The example I just showed you shows that you can use an anonymous function in lapply.

Here is another example. Here, we applied the length function to every member of the list third.list. The first member of third.list is a vector of length one, and thus has a length of one. The second member of third.list is a vector of length two, and thus has a length of two. The third member of third.list is a list of length three, and thus has a length of three. The function length works for lists as well as vectors.

# Continued from above...
> lapply(third.list,length)
[[1]]
[1] 1

[[2]]
[1] 2

[[3]]
[1] 3
> length(third.list)
[1] 3
>

We applied length to third.list itself, and found that it had a length of three. The first element is an ordinary vector, as is the second element. The third element is itself a list, but it is still just one element of third.list, and third.list has only three elements in it.

Here is wilder example:

> (function(f){f(2)})(sqrt)
[1] 1.4142136
> lapply(list(sqrt,log,function(x){x^2}),function(f){f(2)})
[[1]]
[1] 1.4142136

[[2]]
[1] 0.6931472

[[3]]
[1] 4
>

What happened? We have a function function(f){f(2)} that takes a single argument f and applies it to the argument 2. If f is a function, then that function gets called with the argument 2; we demonstrate this using sqrt, and get the square root of 2. So: in goes a function, out comes the result of applying whatever function to the number 2. So we then lapply this function to a list of functions, getting the list of the results of evaluating each function at 2.

Of course, we could also work harder and do this sort of thing manually using a for loop. Let's create an empty list by calling list() with no arguments, and then add the elements one by one:

> a.list <- list(4,5,6)
> answer.list <- list()
> answer.list
list()
> for (ii in 1:length(a.list)) {
+answer.list[[ii]] <- sqrt(a.list[[ii]])
+ }
> answer.list
[[1]]
[1] 2

[[2]]
[1] 2.236068

[[3]]
[1] 2.449490
>

Notice that an empty list is printed as list().

Another useful function is called sapply. This is a little different, since sapply will return a vector of answers if it can. Let's do a few experiments to see how sapply works and how it differs from

> a.list <- list(4,5,6)
> sapply(a.list,sqrt)
[1] 1.000000 1.4142136 1.732051
> lapply(a.list,sqrt)
[[1]]
[1] 1.4142136

[[2]]
[1] 0.6931472

[[3]]
[1] 4
>

So here, sapply took a list of inputs (each of which happened to be a single number), and produced a vector of outputs. It produced a vector because each result was a number and you can have a vector of numbers. Here is another example:

> a.list <- list(c("aa","bb"),c(TRUE,TRUE,FALSE),1:100)
> sapply(a.list,length)
[1] 2 3 100
> lapply(a.list,length)
[[1]]
[1] 2

[[2]]
[1] 3

[[3]]
[1] 100
>

Notice here we used a list, each element of which was a vector. And the vectors had different lengths; the first was a character vector of length 2, the second a boolean/logical vector of length 3, and the last a numeric vector of length 100. We can apply the length function to each element of a.list and get a collection of the corresponding lengths. When we use sapply it is kind enough to produce a vector of results; when we use lapply we get a list of results. If sapply can't return a vector of results, it will return a list instead.

How about this? Let's get the dimensions of each table, out of a list of tables:

> a.list <- list(rbind(c(2,3),c(4,5)),rbind(c(1,1,2),c(3,4,5),c(3,4,7)))
> a.list
[[1]]
[,1] [,2]
[1,] 2 3
[2,] 4 5

[[2]]
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 3 4 5
[3,] 3 4 7
# Here, a.list really is a list of tables.
> sapply(a.list,dim)
[,1] [,2]
[1,] 2 3
[2,] 2 3
> lapply(a.list,dim)
[[1]]
[1] 2 2

[[2]]
[1] 3 3
>

Note that here sapply produced a table of the results by column, and lapply produced a list of the results.

The Data Frame

The data frame is the workhorse data structure for statistical analysis in R (and S). While this class really emphasizes programming rather than statistical analysis, understanding the data frame is important if you are to make the most effective use of R. The manual page lists the data frame as the "fundamental data structure used by most of R's modeling software" (version 1.4.1, Linux).

Fortunately, the data frame is essentially like a table (array), which we have already seen, except that the data frame can contain elements of different types. Let's create a simple data set of states, populations (2001 Census estimate), and median household income (1999 estimate, US Census Bureau). We'll just do this for 5 states:

> state <- c("California","Nevada","Tennessee","Rhode Island","Alaska")
> abbrev <- c("CA","NV","TN","RI","AK")
> population <- c(34501130,2106074,5740021,1058920,634892)
> income <- c(47493,44581,36360,42090,51571)
> state.data <- data.frame(abbrev,population,income,row.names=state)
> state.data
abbrev population income California CA 34501130 47493 Nevada NV 2106074 44581 Tennessee TN 5740021 36360 Rhode Island RI 1058920 42090 Alaska AK 634892 51571 >

Now, we can refer to variables (columns) by name using the $ notation, and we can refer to subjects (rows) by name as well:

# continued from above
> state.data$population
[1] 34501130 2106074 5740021 1058920 634892
> state.data$income
[1] 47493 44581 36360 42090 51571
> state.data["Alaska",]
abbrev population income
Alaska AK 634892 51571

We can index variables (columns) by number, just like a list too:

# continued from above
> state.data[[2]]
[1] 34501130 2106074 5740021 1058920 634892

And we can also select rows and columns like we do with an array:

# continued from above
> state.data[2:3,]
abbrev population income Nevada NV 2106074 44581 Tennessee TN 5740021 36360 > state.data[c(5,2,3),c("income","population")]
income population Alaska 51571 634892 Nevada 44581 2106074 Tennessee 36360 5740021

You can do database operations using boolean comparisons as well. Here, we select out only those states with more than 2 million people:

# continued from above
> state.data[state.data$population > 2000000,]
abbrev population income California CA 34501130 47493 Nevada NV 2106074 44581 Tennessee TN 5740021 36360

You can also use the constructor data.frame to add variables to a data frame. Here, we will add a column of data representing the fraction of the population 65 and older in the year 2000:

> state <- c("California","Nevada","Tennessee","Rhode Island","Alaska") > abbrev <- c("CA","NV","TN","RI","AK") > population <- c(34501130,2106074,5740021,1058920,634892) > income <- c(47493,44581,36360,42090,51571) > state.data <- data.frame(abbrev,population,income,row.names=state) > frac.over65 <- c(0.106,0.110,0.124,0.145,0.057) > state.data.2 <- data.frame(state.data,frac.over65) > state.data.2 abbrev population income frac.over65 California CA 34501130 47493 0.106 Nevada NV 2106074 44581 0.110 Tennessee TN 5740021 36360 0.124 Rhode Island RI 1058920 42090 0.145 Alaska AK 634892 51571 0.057

One simple way to drop a column from a data frame is to assign the column the value of NULL:

state <- c("California","Nevada","Tennessee","Rhode Island","Alaska") > abbrev <- c("CA","NV","TN","RI","AK") > population <- c(34501130,2106074,5740021,1058920,634892) > income <- c(47493,44581,36360,42090,51571) > state.data <- data.frame(abbrev, population,income,row.names=state) > frac.over65 <- c(0.106,0.110,0.124,0.145,0.057) > state.data.2 <- data.frame(state.data,frac.over65) > state.data.2$income <- NULL > state.data.2 abbrev population frac.over65 California CA 34501130 0.106 Nevada NV 2106074 0.110 Tennessee TN 5740021 0.124 Rhode Island RI 1058920 0.145 Alaska AK 634892 0.057 > names(state.data.2) [1] "abbrev" "population" "frac.over65"

Note that here we eliminated the income column by assigning state.data.2$income the special value of NULL.

NULL is a very special literal we have not discussed yet. The R language definition says that it is used "whenever there is a need to indicate or specify that an object is absent". To determine whether or not something is NULL, you may use the function is.null:

> #continuing from previous example: > is.null(state.data) [1] FALSE > is.null(state.data$ufo) [1] TRUE > is.null(state.data$income) [1] FALSE > is.null(state.data.2$income) [1] TRUE

Notice that state.data itself is not null; the object exists and has a definite value. But state.data$ufo is NULL since it does not exist; there is no such component. And state.data$income is not null since there is such a component, but state.data.2$income is null; the component does not exist, since we deleted it. So you may use is.null to test for the existence of a component of an object. And you may remove list or data frame components by assigning NULL to the component. Note that this does not work for vector elements:

zz *lt;- 1:10 > zz[4] <- NULL Error in "[<-"(*tmp*, 4, value = NULL) : incompatible types Execution halted

To test whether something is a data frame or not, you may use the function is.data.frame:

# continued from above
> is.data.frame(state.data)
[1] TRUE

Normally, you read in a data frame from a file. The specific way this is done depends on whether you are on UNIX, Linux, Windows, or a MacIntosh. The Windows GUI system has some special commands you will see in the lecture; here, I show you how to use the command line interface.

Suppose that you have the following information in a text file:

abbrev population income

CA 34501130 47493

NV 2106074 44581

TN 5740021 36360

RI 1058920 42090

AK 634892 51571

Suppose the name of this file is state.txt.

We will use the command read.table to read in the file:

> states <- read.table("state.txt",header=TRUE)
> states
abbrev population income 1 CA 34501130 47493 2 NV 2106074 44581 3 TN 5740021 36360 4 RI 1058920 42090 5 AK 634892 51571

When the first line of the file contains the list of variable names (as it did in this example), you use the clause header=TRUE; otherwise, you use header=FALSE.

Another common format for data is the comma separated format. Suppose now we have the data as follows in a file called "state.csv":

abbrev,population,income
CA,34501130,47493
NV,2106074,44581
TN,5740021,36360
RI,1058920,42090
AK,634892,51571

To read in these data, we use the function read.csv:

> states <- read.csv("state.csv",header=TRUE)
> states
abbrev population income 1 CA 34501130 47493 2 NV 2106074 44581 3 TN 5740021 36360 4 RI 1058920 42090 5 AK 634892 51571

The header clause has the same usage as for read.table.

These functions (read.table and read.csv) have considerably more features, some of which we will discuss later in this course.

More on Subscripting

It's time to tell the whole truth about subscripting vectors and lists. Remember vectors are homogeneous collections, and lists are not. Lists are recursive in that they can contain lists within themselves. We have learned the subscript operator [] (usually used for vectors), and [[]] (usually used for lists). The main difference is that [] can be used for subsetting operations (selecting more than one object from a collection), while [[]] can be used for hierarchical selection.

It turns out that you can, in fact, use [] for lists:

> a.list <- list(1:3,c("CA","AZ"),TRUE) > a.list[1] [[1]] [1] 1 2 3

And you can use the single brackets to do subset selection from lists too:

> a.list <- list(1:3,c("CA","AZ"),TRUE) > a.list[c(3,1)] [[1]] [1] TRUE [[2]] [1] 1 2 3

And you can use [[]] for vectors as well. But [[]] needs to yield only a single element of a vector; a vector is not a hierarchical structure.

zz <- 1:10 > zz[[2]] [1] 2 > zz[[c(3,2)]] Error: attempt to select more than one element Execution halted

There are some other subtle differences:

zz <- c(aa=2,bb=3,uu=8) > zz["aa"] aa 2 > zz[["bb"]] [1] 3 > zz$uu NULL

Here, we subscripted a vector using ["aa"], and found that the component name aa was inherited. But when we used the double brackets in [["bb"]], the component name bb was not inherited by the result. This behavior is occasionally useful. Finally, observe that the dollar sign does not work on vectors the way it does on lists and data frames.

The empty vector subscript is occasionally useful for setting all the elements of a vector to some value without destroying component names:

zz <- c(aa=2,bb=3,uu=8) > zz[] aa bb uu 2 3 8 > zz[] <- NA > zz aa bb uu NA NA NA

Here, we set all the elements in zz to NA but left the dimension of the vector unchanged, and did not affect the component names. If we had just typed zz <- NA, then we would have created a vector of length one whose single element was simply NA, not the same thing at all.

Control Operators

We have already seen the use of the componentwise boolean operators & and | (and of course !). These operate on entire vectors of boolean values. Thus, we see

> c(TRUE,TRUE,FALSE,FALSE) & c(TRUE,FALSE,TRUE,FALSE) [1] TRUE FALSE FALSE FALSE

for example.

But for determining the logical flow of your program, we have seen that in an if statement, you must make a single decision (at least in a single-threaded environment such as the one we are using!).

Moreover, a very common pattern is to check a condition and then perform a computation provided a condition is met. For instance, suppose we have a function to compute the incidence rate by dividing a number of cases by a number of person-years at risk. We will suppose that the function will produce an error if it gets a population value of zero; in our case, we will simply check for a value of zero and abort the computation. But remember that computations can fail for any number of reasons, so this provides us an example of a computation that fails for certain inputs. So we will check for those inputs, or we think that's what we are going to do. We have a function called inc that will fail if the denominator is zero. The idea is that we wish to only call it whenever the population is nonzero.

inc <- function(cases,py) { + if (any(py<=0)) { + stop("no person-years at risk") + } else { + cases/py + } + } > nn <- 0 > pop <- 0 > if (pop>0 & inc(nn,pop)>0.05) { + cat("incidence rate over 5%\n") + } Error in inc(nn, pop) : no person-years at risk Execution halted

But this does not work. We use the single ampersand and operator, and it evaluates both arguments. The first argument is FALSE, and when it tries to evaluate the second argument and call the function inc, the execution is halted because the computation fails. This isn't what we wanted at all. We only want to try the computation when the conditions are right for it, and we are using the & to test for those conditions.

What we really want is to evaluate the first argument, and only if it yields TRUE is it even worth trying to evaluate the second argument; if the first argument is FALSE the logical result of and must be FALSE. Such an operator exists, and is called &&. This operator only evaluates the second argument if the first is TRUE. If the first argument is TRUE, the second is evaluated, and if the second yields TRUE the result of the && is TRUE and it is FALSE otherwise. So this has quite different semantics in that the second argument is not even evaluated unless it is needed. This way, you can prevent yourself from attempting computations that will fail in an easy way. You can also use this to avoid expensive computations. So in this example, we have

> inc <- function(cases,py) { + if (any(py<=0)) { + stop("no person-years at risk") + } else { + cases/py + } + } > nn <- 0 > pop <- 0 > if (pop>0 && inc(nn,pop)>0.05) { + cat("incidence rate over 5%\n") + } else { + cat("incidence rate not over 5%\n") + } incidence rate not over 5%

The computation succeeds because the troublesome computation inc is not even attempted if pop equals zero. In some languages, && is called a short-circuit boolean operator and most languages provide this operator.

It is also worth noting that && produces a vector of length one, that is, a single boolean result. This operator is designed for use in determining control flow; it is for use in if and while expressions. The componentwise operator & is designed for data manipulation and boolean subscripting.

There is a short-circuit or operator too, called ||. Here, the first argument is evaluated. If it is TRUE, then the value of the expression is TRUE and the second argument is not even evaluated. If the first argument is FALSE, then the second argument is evaluated and its value is the value of the entire expression.

The Semicolon

Normally you separate statements by putting one on each line. You may place several statements on the same line if you separate them with semicolons, as you will see in the next example.

Compound Statements

It's time to tell the whole truth about the braces {} too. The braces collect a sequence of statements together to form a compound statement. The value of the compound statement is the value of the last expression evaluated within the compound statement.

> zz <- {aa <- 8; 7} > zz [1] 7 > aa [1] 8

Here, we had the compound statement {aa <- 8; 7} on the right hand side of an assignment statement. When it is evaluated, the value of 8 is assigned to aa, and then the value 7 is evaluated. The 7 is the last expression evaluated inside the compound statement, so the value of the entire compound statement is 7, and this value is assigned to zz after the compound statement is finished. We could have avoided the semicolon by using a line break, however:

> zz <- {aa <- 8 + 7} > zz [1] 7 > aa [1] 8

This is not normally what you do with a compound statement. You have already seen them in action:

needle <- function(answers,curtime,endtime,pp,qq) { + if (curtime >= endtime) { + answers + } else { + xx <- answers[curtime] + newx <- (1-qq)*(xx+(1-xx)*pp) + needle(c(answers,newx),curtime+1,endtime,pp,qq) + } + }

The entire expression in red is a compound statement. Moreover, there are two more compound statements inside as well, indicated in blue and green:

needle <- function(answers,curtime,endtime,pp,qq) { + if (curtime >= endtime) { + answers + } else { + xx <- answers[curtime] + newx <- (1-qq)*(xx+(1-xx)*pp) + needle(c(answers,newx),curtime+1,endtime,pp,qq) + } + }

Here is another example from before, with the compound statement in red:

for (ii in c(3,9,4,7)) { + cat("The square of ",ii," is ", ii^2,".\n") + }

And finally of course we've seen compound statements in the while loop:

> ntrials <- 1 > while (sample(c("H","T"),1,replace=TRUE)=="H") { + ntrials <- ntrials+1 + }

So now you know.

Switch

The switch function is occasionally helpful. It allows you to selectively evaluate one of several expressions depending on the first argument.

> x <- switch(2, "one","two","three") > x [1] "two"

Here, only the second expression in the argument sequence following the first argument is actually evaluated (because the first argument is 2), and it is returned as the value of switch.

That example does not quite exhibit all the power of switch. Only one of the expressions is even evaluated:

> x <- switch(2, {cat("doing first...\n");"one"}, + {cat("doing second...\n");"two"}, + {cat("doing third...\n");"three"}) doing second... > x [1] "two"

Here, we used compound statements inside the switch. Each compound statement contains a cat to print a message followed by the desired value. You can see that only the second compound statement is ever evaluated because the only thing that is printed is doing second....

So switch allows a multiway branch based on a particular value. Of course this is usually achieved by a series of if/else branches.

You can also use names and character strings in a switch statement. Here is an example taken from the R language definition illustrating the use of the switch:

> center <- function(thedata,ctype) { + switch(ctype, + mean=mean(thedata), + median=median(thedata)) + } > center(c(1,1,4),"mean") [1] 2 > center(c(1,1,4),"median") [1] 1

Of course, we could have achieved the same thing with if:

> center <- function(thedata,ctype) { + if (ctype=="mean") { + mean(thedata) + } else if (ctype=="median") { + median(thedata) + } + } > center(c(1,1,4),"mean") [1] 2 > center(c(1,1,4),"median") [1] 1

Of course, we could have achieved the same thing with if:

Next lecture, we will cover a few more details about function calls, the ifelse function, and two useful variants of the assignment operator. Then we will do a few small examples and look at the needle reuse case study.

abbrev	population	income
CA	34501130	47493
NV	2106074	44581
TN	5740021	36360
RI	1058920	42090
AK	634892	51571