Home About us Mathematical Epidemiology Rweb EPITools Statistics Notes Web Design Search Contact us |
Sometimes you have data that is most naturally represented as a table. For instance, I may want to do many scenarios for the risk analysis. It would be natural to have four columns labeled needles, reuses, prevalence, and transmission.risk, and to have each row be a particular choice of values. Or we may want to represent a two-by-two table representing the results of a risk factor study.
Because the original S language was designed to facilitate data analysis, there is considerable support for such data structures. The most important table structure is called the data frame, and it allows different data types for each column. Before we get to the data frame, however, I want to talk about numeric tables. Numeric tables are (under the hood) vectors like any other, but they can be manipulated by row and column.
Let's construct a simple table by stacking two rows on top of each other. Each row will start out as an ordinary vector, built by c(). But we will stack them by using rbind(), row bind:
> two.by.two <- rbind(c(37,3),c(2,95)) > two.by.two
|
Now, we can refer to the first row:
> two.by.two <- rbind(c(37,3),c(2,95)) > two.by.two[1,] [1] 37 3 > |
We can select columns as well. For instance, to select the second column, we would use [,2] as the subscript. Notice that when R displays a table, it reminds you of how to refer to rows ([1,] for instance) and columns ([,1] for instance).
> two.by.two <- rbind(c(37,3),c(2,95)) > two.by.two[,2] [1] 3 95 > |
Let's look at single elements again in the table. For instance, let's look at the first row, second column:
> two.by.two <- rbind(c(37,3),c(2,95)) > two.by.two[1,2] [1] 3 > |
We can stack columns together to create tables by using cbind. This binds columns together to create a table.
> two.by.two <- cbind(c(37,2),c(3,95)) > two.by.two
|
You can refer to individual elements of a table as well, though normally you don't need to do this.
> two.by.two <- cbind(c(37,2),c(3,95)) > two.by.two
[1] 37 > two.by.two[2] [1] 2 > two.by.two[3] [1] 3 > two.by.two[4] [1] 95 > |
> two.by.two <- cbind(c(37,2),c(3,95)) > two.by.two[1.2] [1] 37 > |
The rbind (and cbind) functions can be use to add rows (or columns) to tables, or to combine two tables together:
> table.1 <- cbind(c(37,2),c(3,8),c(5,7)) > table.2 <- cbind(c(3,2),c(9,8)) > rbind(table.1,c(4,5,6))
|
The elements of a table can be assigned (changed) using the gets operator:
> two.by.two <- cbind(c(37,2),c(3,95)) > two.by.two
|
The gets operator can be used to assign or change entire rows of a table as well:
> two.by.two <- cbind(c(37,2),c(3,95)) > two.by.two
|
This works for columns as well. Let's take the same table and assign 50 10 to the second column instead:
> two.by.two <- cbind(c(37,2),c(3,95)) > two.by.two
|
A table can be considered as a two-dimensional structure, an array of numbers. Or it can be considered a collection of vectors of equal length, as a collection of its rows or as a collection of its columns.
When you assign something to a row or column of a table using gets that is too short, it gets duplicated to fill up the table. Let's take a matrix with two rows and three columns, and try to set the second row.
> a.table <- rbind(c(3,2,9),c(4,8,11)) > a.table
|
We can select a subset of the rows or columns, in any order, if we wish:
> a.table <- rbind(c(3,2,9),c(4,8,11)) > a.table
|
So far, we have seen only numeric tables. In realistic data sets, the columns (variables) can be of different types. We will soon see how to do this in R. The tables we have seen so far must be of a single data type. They can be boolean:
> a.table <- rbind(c(TRUE,FALSE,FALSE),c(TRUE,TRUE,FALSE)) > a.table
|
It is possible to perform map operations on a table. Remember that by map we mean applying a function to each element of a collection, yielding the collection of corresponding results. So far the only applications of this have used vectors as the collection. Tables, however, are more complex structures, and as we've stated, can be considered as the collection of their rows, or as the collection of their columns, or as the collection of all their elements. So in principle there are three different kinds of map operations you could do with a table.
Let's begin with the last one of these. I'm going to create a table of numbers and then take the square root of each element in the table, creating a table of the square roots:
> a.table <- rbind(c(3,2,9),c(4,8,11)) > a.table
|
Arithmetic operations can be performed as well, using tables:
> a.table <- rbind(c(3,2,9),c(4,8,11)) > a.table
|
If you use a longer vector, it gets duplicated out until it has the same number of elements as the table does, and then the elements of the vector are added to the table (in column order).
> a.table <- rbind(c(3,2,9),c(4,8,11)) > a.table
|
Adding two tables together is potentially useful as well. This can be done by simply using a plus sign between the two tables:
> a.table <- rbind(c(3,2,9),c(4,8,11)) > a.table
> b.table
|
The other arithmetic operators work the same way. For brevity we won't go through all those explicitly. Remember that using the multiply operator * on tables gives you elementwise multiplication (NOT matrix multiplication, which we'll see later).
Now, you've seen how you can use the rules for how vectors are added to tables in a useful way, to add one number to the first row and something different to the second row. But the R system considers this quite different from placing a table on both sides of the plus sign (or any other operator). If you have a table on both sides of the plus sign, then they must be conformable: they must have the same number of rows and columns:
> a.table <- rbind(c(3,2,9),c(4,8,11)) > a.table
> b.table
Error in a.table + b.table : non-conformable arrays > |
We can perform boolean operations as well:
> a.table <- rbind(c(3,2,9),c(4,8,11)) > a.table
|
We've seen how to create new tables (arrays) with rbind and cbind. It's occasionally useful to be able to ask a table how many rows and columns it has:
> a.table <- rbind(c(3,2,9),c(4,8,11)) > a.table
[1] 2 3 > |
The transpose function, t(), flips a table around, making the rows into columns and the columns into rows:
> a.table <- rbind(c(3,2,9),c(4,8,11)) > a.table
[1] 3 2 > |
Another useful operation is to create a table of a specified number of rows and columns. Let's create a table with 40 rows and 50 columns:
> a.table <- matrix(0,nrow=40,ncol=50) > dim(a.table) [1] 40 50 > a.table[1,1] [1] 0 > |
You can also place a longer as the first argument to matrix. Then the function matrix will duplicate the vector until it is long enough, or truncate it if it is too long. As always, the vector elements are filled into the table in column order:
> a.table <- matrix(c(1,2,3,4,5,6),nrow=2,ncol=3) > a.table
|
You can even use a table itself as the first argument to matrix; when this happens, the table is treated as a vector, as always, in column order. You can use this technique to reshape tables (so for instance, you could turn a 3 by 4 table into a 2 by 6 table). This can be error-prone, so it is good to carefully check the results of this sort of manipulation.
So now we can create new tables, select rows and columns, perform repetition operations over the elements of a table, and transpose tables.
Operations that reduce the dimension of a table are called reductions. An example is to add up all the numbers in a vector: we go from a one-dimensional vector (a linear sequence of numbers) down to a single number. (Of course in R single numbers are represented as vectors of length one.) Or if we sum up all the columns of a table: we go from a two dimensional array of numbers, down to a single-dimensional list of the column sums. Notice that adding up all the columns of a table is in fact a map operation: we have a function that can sum up a column of numbers, we consider the table to be a collection of columns, and we apply the summing function to the collection of columns to produce the collection of sums.
R and its predecessor S are statistical packages as well as programming languages, so there are very many useful statistical functions which can operate on columns (or rows) of data, on vectors, and so forth.
Let's begin with some fundamental operations, such as adding up the elements of a vector using sum:
> a.vector <- c(2,4,6) > sum(a.vector) [1] 12 > |
Another operation is to multiply all the elements of a vector together, using prod:
> prod(1:8) [1] 40320 > |
Taking the maximum and minimum of a collection are useful too. We use max and min for this:
> max(c(3,9,1,-10)) [1] 9 > min(c(3,9,1,-10)) [1] -10 > |
There are useful reductions for boolean vectors too. In particular, any accepts a boolean vector, and returns TRUE if there is any TRUE element in the vector (at least one element is TRUE). The function all accepts a boolean vector, and returns TRUE only if all elements of the vector are TRUE:
> v1 <- c(TRUE,TRUE,TRUE) > v2 <- c(FALSE,TRUE,TRUE) > v3 <- c(FALSE,FALSE,FALSE) > any(v1) [1] TRUE > any(v2) [1] TRUE > any(v3) [1] FALSE > all(v1) [1] TRUE > all(v2) [1] FALSE > all(v3) [1] FALSE > |
These operations, sum, prod, max, min, any, and all are examples of a programming idea known as fold. In a fold, you have a function that takes two arguments, and a collection of values. You may perform the function on the first two arguments, producing a result, then take that result and the third argument and apply the function again, and so forth. Many repetitive operations can be expressed as folds, especially if you have more general or powerful functions.
Let us see how to apply these functions to rows or columns of arrays. First, let's begin with sum, since they all work the same. There are three ways to apply sum to a table: you could add up everything in the whole table to get a single grand total, you could add up the columns, or you could add up the rows. We can add up everything in a table by just placing the table as the argument of sum:
> table.1 <- cbind(c(37,2),c(3,8),c(5,7)) > sum(table.1) [1] 62 > |
But to apply the sum operation to all the rows (say), we will need something different. We will use the apply function. The apply function requires an array, a margin (one if by rows and two if by columns), and a function to apply.
> table.1 <- cbind(c(37,2),c(3,8),c(5,7)) > apply(table.1,1,sum) [1] 9 12 > apply(table.1,2,sum) [1] 3 7 11 > |
Any function can be used in apply. We could use prod, max, etc.
Another useful idea is that represented by outer. This provides a nice way to do something to all possible pairs. To apply this, we need two vectors, and a function. If I have two vectors, say aa and bb, and a function (say "+"), I could add the first element of aa to the first element of bb, the first element of aa to the second element of bb, and so on, adding up all pairs and producing a table of results. Let's see how it works:
> aa <- c(1,3,5) > bb <- c(2,4) > tbl <- outer(aa,bb,FUN="+") > tbl
|
As we have seen, the builtin support for array programming allows us to do quite a bit of repetition in a very natural and simple way. There are further useful features to support repetition which we will need, and will learn in the next lecture.