Tables (Arrays)

Sometimes you have data that is most naturally represented as a table. For instance, I may want to do many scenarios for the risk analysis. It would be natural to have four columns labeled needles, reuses, prevalence, and transmission.risk, and to have each row be a particular choice of values. Or we may want to represent a two-by-two table representing the results of a risk factor study.

Because the original S language was designed to facilitate data analysis, there is considerable support for such data structures. The most important table structure is called the data frame, and it allows different data types for each column. Before we get to the data frame, however, I want to talk about numeric tables. Numeric tables are (under the hood) vectors like any other, but they can be manipulated by row and column.

Let's construct a simple table by stacking two rows on top of each other. Each row will start out as an ordinary vector, built by c(). But we will stack them by using rbind(), row bind:

> two.by.two <- rbind(c(37,3),c(2,95))
> two.by.two
[,1] [,2] [1,] 37 3 [2,] 2 95 >

Now, we can refer to the first row:

> two.by.two <- rbind(c(37,3),c(2,95))
> two.by.two[1,]
[1] 37 3
>
Note that the first row of the table is actually a vector.

We can select columns as well. For instance, to select the second column, we would use [,2] as the subscript. Notice that when R displays a table, it reminds you of how to refer to rows ([1,] for instance) and columns ([,1] for instance).

> two.by.two <- rbind(c(37,3),c(2,95))
> two.by.two[,2]
[1] 3 95
>
The second column of the table is also a vector.

Let's look at single elements again in the table. For instance, let's look at the first row, second column:

> two.by.two <- rbind(c(37,3),c(2,95))
> two.by.two[1,2]
[1] 3
>
Notice that we are using two separate values separated by a comma in the subscript.

We can stack columns together to create tables by using cbind. This binds columns together to create a table.

> two.by.two <- cbind(c(37,2),c(3,95))
> two.by.two [,1] [,2] [1,] 37 3 [2,] 2 95 >
Here you're stacking them together vertically.

You can refer to individual elements of a table as well, though normally you don't need to do this.

> two.by.two <- cbind(c(37,2),c(3,95))
> two.by.two [,1] [,2] [1,] 37 3 [2,] 2 95 > two.by.two[1] [1] 37 > two.by.two[2] [1] 2 > two.by.two[3] [1] 3 > two.by.two[4] [1] 95 >
Notice that the system orders the elements of a table by column. Again, you normally don't need to do this, but sometimes you type a period instead of a comma, so that you have used one subscript instead of two:

> two.by.two <- cbind(c(37,2),c(3,95))
> two.by.two[1.2] [1] 37 >
The value 1.2 is truncated to one, and then the first element of the table is selected. If you intended to type 1,2 (with the comma), then this is an error, and you should understand what is happening.

The rbind (and cbind) functions can be use to add rows (or columns) to tables, or to combine two tables together:

> table.1 <- cbind(c(37,2),c(3,8),c(5,7))
> table.2 <- cbind(c(3,2),c(9,8))
> rbind(table.1,c(4,5,6)) [,1] [,2] [,3] [1,] 37 3 5 [2,] 2 8 7 [3,] 4 5 6 > cbind(table.1,table.2) [,1] [,2] [,3] [,4] [,5] [1,] 37 3 5 3 9 [2,] 2 8 7 2 8 >

The elements of a table can be assigned (changed) using the gets operator:

> two.by.two <- cbind(c(37,2),c(3,95))
> two.by.two [,1] [,2] [1,] 37 3 [2,] 2 95 > two.by.two[1,2] <- -40 [,1] [,2] [1,] 37 -40 [2,] 2 95 >

The gets operator can be used to assign or change entire rows of a table as well:

> two.by.two <- cbind(c(37,2),c(3,95))
> two.by.two [,1] [,2] [1,] 37 3 [2,] 2 95 > two.by.two[1,] <- c(50,10) [,1] [,2] [1,] 50 10 [2,] 2 95 >
Here, we set the first row to be 50 10.

This works for columns as well. Let's take the same table and assign 50 10 to the second column instead:

> two.by.two <- cbind(c(37,2),c(3,95))
> two.by.two [,1] [,2] [1,] 37 3 [2,] 2 95 > two.by.two[,2] <- c(50,10) [,1] [,2] [1,] 37 50 [2,] 2 10 >

A table can be considered as a two-dimensional structure, an array of numbers. Or it can be considered a collection of vectors of equal length, as a collection of its rows or as a collection of its columns.

When you assign something to a row or column of a table using gets that is too short, it gets duplicated to fill up the table. Let's take a matrix with two rows and three columns, and try to set the second row.

> a.table <- rbind(c(3,2,9),c(4,8,11))
> a.table [,1] [,2] [,3] [1,] 3 2 9 [2,] 4 8 11 > a.table[,2] <- 7 [,1] [,2] [,3] [1,] 3 2 9 [2,] 7 7 7 >
The second row is of length three, so the 7 (remember: a vector of length one) is duplicated out to length three, and then the assignment is performed.

We can select a subset of the rows or columns, in any order, if we wish:

> a.table <- rbind(c(3,2,9),c(4,8,11))
> a.table [,1] [,2] [,3] [1,] 3 2 9 [2,] 4 8 11 > a.table[c(2,1),c(3,1)] [,1] [,2] [1,] 11 4 [2,] 9 3 >

So far, we have seen only numeric tables. In realistic data sets, the columns (variables) can be of different types. We will soon see how to do this in R. The tables we have seen so far must be of a single data type. They can be boolean:

> a.table <- rbind(c(TRUE,FALSE,FALSE),c(TRUE,TRUE,FALSE))
> a.table [,1] [,2] [,3] [1,] TRUE FALSE FALSE [2,] TRUE TRUE FALSE >

Using tables

It is possible to perform map operations on a table. Remember that by map we mean applying a function to each element of a collection, yielding the collection of corresponding results. So far the only applications of this have used vectors as the collection. Tables, however, are more complex structures, and as we've stated, can be considered as the collection of their rows, or as the collection of their columns, or as the collection of all their elements. So in principle there are three different kinds of map operations you could do with a table.

Let's begin with the last one of these. I'm going to create a table of numbers and then take the square root of each element in the table, creating a table of the square roots:

> a.table <- rbind(c(3,2,9),c(4,8,11))
> a.table [,1] [,2] [,3] [1,] 3 2 9 [2,] 4 8 11 > sqrt(a.table [,1] [,2] [,3] [1,] 1.732051 1.414214 3.000000 [2,] 2.000000 2.828427 3.316625 >
Here we applied a function (sqrt) to a table of numbers, and produced a table of the square roots. This is another example of the map idea in action. We have repeated a computation effortlessly thanks to the services provided by the R system.

Arithmetic operations can be performed as well, using tables:

> a.table <- rbind(c(3,2,9),c(4,8,11))
> a.table [,1] [,2] [,3] [1,] 3 2 9 [2,] 4 8 11 > a.table + 4 [,1] [,2] [,3] [1,] 7 6 13 [2,] 8 12 15 >
This illustrates adding the same element to each element of a table, producing a table of values. For example, for some statistical applications you must add 0.5 to each element of a table before computing an odds ratio, and you could use this technique to do so.

If you use a longer vector, it gets duplicated out until it has the same number of elements as the table does, and then the elements of the vector are added to the table (in column order).

> a.table <- rbind(c(3,2,9),c(4,8,11))
> a.table [,1] [,2] [,3] [1,] 3 2 9 [2,] 4 8 11 > a.table + c(1,100) [,1] [,2] [,3] [1,] 4 3 10 [2,] 104 108 111 >
What has happened? We added 1 to everything in the first row, and 100 to everything in the second row. This is because the vector c(1,100) has two elements, and the table has six elements; therefore, c(1,100) becomes c(1,100,1,100,1,100) before the addition happens. Then the elements of the table are traversed columnwise.

Adding two tables together is potentially useful as well. This can be done by simply using a plus sign between the two tables:

> a.table <- rbind(c(3,2,9),c(4,8,11))
> a.table

[,1] [,2] [,3]

[1,] 3 2 9

[2,] 4 8 11
> b.table <- rbind(c(1,9,3),c(5,23,13))
> b.table [,1] [,2] [,3] [1,] 1 9 3 [2,] 5 23 13 > a.table + b.table [,1] [,2] [,3] [1,] 4 11 12 [2,] 9 31 24 >

The other arithmetic operators work the same way. For brevity we won't go through all those explicitly. Remember that using the multiply operator * on tables gives you elementwise multiplication (NOT matrix multiplication, which we'll see later).

Now, you've seen how you can use the rules for how vectors are added to tables in a useful way, to add one number to the first row and something different to the second row. But the R system considers this quite different from placing a table on both sides of the plus sign (or any other operator). If you have a table on both sides of the plus sign, then they must be conformable: they must have the same number of rows and columns:

> a.table <- rbind(c(3,2,9),c(4,8,11))
> a.table

[,1] [,2] [,3]

[1,] 3 2 9

[2,] 4 8 11
> b.table <- rbind(c(10,4),c(4,5))
> b.table [,1] [,2] [1,] 10 4 [2,] 4 5 > a.table + b.table Error in a.table + b.table : non-conformable arrays >

We can perform boolean operations as well:

> a.table <- rbind(c(3,2,9),c(4,8,11))
> a.table [,1] [,2] [,3] [1,] 3 2 9 [2,] 4 8 11 > a.table == 2 [,1] [,2] [,3] [1,] FALSE TRUE FALSE [2,] FALSE FALSE FALSE >
The same rules apply for boolean operators and comparisons involving vectors and tables.

We've seen how to create new tables (arrays) with rbind and cbind. It's occasionally useful to be able to ask a table how many rows and columns it has:

> a.table <- rbind(c(3,2,9),c(4,8,11))
> a.table

[,1] [,2] [,3]

[1,] 3 2 9

[2,] 4 8 11
> dim(a.table)
[1] 2 3
>
We use the function dim to tell us the dimensions of the table. We get a vector; the first element is the number of rows, and the second is the number of columns. You can also use nrow to find the number of rows and ncol to find the number of columns.

The transpose function, t(), flips a table around, making the rows into columns and the columns into rows:

> a.table <- rbind(c(3,2,9),c(4,8,11))
> a.table

[,1] [,2] [,3]

[1,] 3 2 9

[2,] 4 8 11
> t(a.table)

[,1] [,2]

[1,] 3 4

[2,] 2 8

[3,] 9 11
> dimt(a.table))
[1] 3 2
>
Notice that the transposed table has three rows and two columns, as we are informed by dim.

Another useful operation is to create a table of a specified number of rows and columns. Let's create a table with 40 rows and 50 columns:

> a.table <- matrix(0,nrow=40,ncol=50)
> dim(a.table)
[1] 40 50
> a.table[1,1]
[1] 0
>
We used the function matrix to create a new table. The table is too big to print, but we can verify that it has 40 rows and 50 columns, and we can look at a few elements. The first argument to matrix specifies what you want the elements to be; putting a single number there (a vector of length one) fills up the whole table with that value. In this case, we filled the table with zeros by using 0 as the first arguments. We then specify the number of rows with the nrow= argument, and the number of columns with the ncol= argument.

You can also place a longer as the first argument to matrix. Then the function matrix will duplicate the vector until it is long enough, or truncate it if it is too long. As always, the vector elements are filled into the table in column order:

> a.table <- matrix(c(1,2,3,4,5,6),nrow=2,ncol=3)
> a.table

[,1] [,2] [,3]

[1,] 1 3 5

[2,] 2 4 6
>

You can even use a table itself as the first argument to matrix; when this happens, the table is treated as a vector, as always, in column order. You can use this technique to reshape tables (so for instance, you could turn a 3 by 4 table into a 2 by 6 table). This can be error-prone, so it is good to carefully check the results of this sort of manipulation.

So now we can create new tables, select rows and columns, perform repetition operations over the elements of a table, and transpose tables.

Array Reductions

Operations that reduce the dimension of a table are called reductions. An example is to add up all the numbers in a vector: we go from a one-dimensional vector (a linear sequence of numbers) down to a single number. (Of course in R single numbers are represented as vectors of length one.) Or if we sum up all the columns of a table: we go from a two dimensional array of numbers, down to a single-dimensional list of the column sums. Notice that adding up all the columns of a table is in fact a map operation: we have a function that can sum up a column of numbers, we consider the table to be a collection of columns, and we apply the summing function to the collection of columns to produce the collection of sums.

R and its predecessor S are statistical packages as well as programming languages, so there are very many useful statistical functions which can operate on columns (or rows) of data, on vectors, and so forth.

Let's begin with some fundamental operations, such as adding up the elements of a vector using sum:

> a.vector <- c(2,4,6)
> sum(a.vector)
[1] 12
>

To add up numbers manually, you would add the first to the second, then add the third to the running total, and so forth. You would be repeating an addition operation, but with sum the repetition is done automatically.

Another operation is to multiply all the elements of a vector together, using prod:

> prod(1:8)
[1] 40320
>

Again, to do this manually, you would have to repeat the multiplication operation, but prod does the repetition automatically. By the way, multiplying the integers from 1 to N is called computing the factorial of N, and this product is usually denoted N! (N followed by an exclamation point).

Taking the maximum and minimum of a collection are useful too. We use max and min for this:

> max(c(3,9,1,-10))
[1] 9
> min(c(3,9,1,-10))
[1] -10
>

Again, to do this manually, you would have to repeat a comparison operation, but max and min do the repetition automatically. By

There are useful reductions for boolean vectors too. In particular, any accepts a boolean vector, and returns TRUE if there is any TRUE element in the vector (at least one element is TRUE). The function all accepts a boolean vector, and returns TRUE only if all elements of the vector are TRUE:

> v1 <- c(TRUE,TRUE,TRUE)
> v2 <- c(FALSE,TRUE,TRUE)
> v3 <- c(FALSE,FALSE,FALSE)
> any(v1)
[1] TRUE
> any(v2)
[1] TRUE
> any(v3)
[1] FALSE
> all(v1)
[1] TRUE
> all(v2)
[1] FALSE
> all(v3)
[1] FALSE
>

To do this manually, you would have to repeat a boolean operation.

These operations, sum, prod, max, min, any, and all are examples of a programming idea known as fold. In a fold, you have a function that takes two arguments, and a collection of values. You may perform the function on the first two arguments, producing a result, then take that result and the third argument and apply the function again, and so forth. Many repetitive operations can be expressed as folds, especially if you have more general or powerful functions.

Let us see how to apply these functions to rows or columns of arrays. First, let's begin with sum, since they all work the same. There are three ways to apply sum to a table: you could add up everything in the whole table to get a single grand total, you could add up the columns, or you could add up the rows. We can add up everything in a table by just placing the table as the argument of sum:

> table.1 <- cbind(c(37,2),c(3,8),c(5,7))
> sum(table.1)
[1] 62
>

But to apply the sum operation to all the rows (say), we will need something different. We will use the apply function. The apply function requires an array, a margin (one if by rows and two if by columns), and a function to apply.

> table.1 <- cbind(c(37,2),c(3,8),c(5,7))
> apply(table.1,1,sum)
[1] 9 12
> apply(table.1,2,sum)
[1] 3 7 11
>

Notice that this has implemented a map operation. In the first case, we consider the table to be a collection of its two rows, and we apply the sum operation to each of them. In the second case, we consider the table to be a collection of its columns, and we apply sum to each of these. Finally, observe that the result of apply is a vector, not a table.

Any function can be used in apply. We could use prod, max, etc.

Outer

Another useful idea is that represented by outer. This provides a nice way to do something to all possible pairs. To apply this, we need two vectors, and a function. If I have two vectors, say aa and bb, and a function (say "+"), I could add the first element of aa to the first element of bb, the first element of aa to the second element of bb, and so on, adding up all pairs and producing a table of results. Let's see how it works:

> aa <- c(1,3,5)
> bb <- c(2,4)
> tbl <- outer(aa,bb,FUN="+")
> tbl

[,1] [,2]

[1,] 3 5

[2,] 5 7

[3,] 7 9

>
Notice that to use the plus sign, we must use it in quotes according to the syntax given. Also notice that since the vector aa came first, it corresponds to the rows, and the elements of bb correspond to the columns. As an exercise, find out how to create a simple multiplication table using outer.

As we have seen, the builtin support for array programming allows us to do quite a bit of repetition in a very natural and simple way. There are further useful features to support repetition which we will need, and will learn in the next lecture.