under construction; last updated September 2, 2003

Home About us Mathematical Epidemiology Rweb EPITools Statistics Notes Web Design Search Contact us

> Home > Computational Epidemiology Course > Lecture 1

Interacting with the Interpreter

We are working with the command line interface to the R interpreter.

Working with R in this way is much like using a desk calculator. You type an expression for R to evaluate. R reads the expression, evaluates it, and prints the result.

How you start up R depends on your computer. On many systems you may simply click or double click on an icon representing the program R.

When you finish an R session, you may quit by typing q(), the quit function. Depending on your system, you may be able to quit by using a menu interface.

When you quit an R session, R gives you an opportunity to save your session. If you decline to save the workspace, then the session is lost and R will not remember what you did next time you begin. If you agree to save the session, then R will start next time where you left off this time.

Exercise. Start the R interpreter and see that it is working properly. Then exit by using the quit function q().

Literals

The simplest type of expression you may enter for R to evaluate is simply a literal. A literal is an expression which is supposed to stand for itself, so to speak.

Numeric literals

Numeric literals represent numbers.

For example, you may type ordinary integers into the interpreter:

> 2
[1] 2
>

What is this? First you see the command prompt, the sign that looks like the greater than symbol >. Then, we show what we or you type in blue (normally) or red (if we want to emphasize it. In this case you see the response [1] 2. The [1] indicates that the output R is giving you starts with the first result you wanted computed. Here, that's all there is. Then you see the actual result, 2. Then I showed the next command prompt that R wrote, indicating that the computation is over and the interpreter awaits your next instruction.

This is certainly not a hard computation. You asked it to compute the value of 2, and it let you know that the result is just 2. Here, we simply typed the literal 2, which represented the number two.

Numbers you type in to R are normally converted to floating point representation. In a floating point representation, the significant figures are stored separately from the exponent. Only a finite number of significant figures are kept.

For instance, if you were computing with pencil and paper, you might approximate one third by 0.3333333. For many purposes this might be close enough to one third, close enough that the difference would not matter for your purpose; what is good enough for one purpose might not be good enough for another. But conceptually, 0.3333333 is NOT equal to one third. The difference is called roundoff error. We will discuss this more in subsequent lectures.

R tries to minimize roundoff error by keeping more digits internally than you will need for many applications. Internally, R tries to take advantage of double precision floating point representations of your numbers.

Floating point literals may include the exponent. To represent 2.2 × 10⁸:

> 2.2e8
[1] 2.2e+08
>

Notice that R adds the sign of the exponent and prefixes the eight with a zero. It is OK if you add the plus sign:

> 2.2e+8
[1] 2.2e+08
>

If you are trying to represent a small number, you use a negative number. For instance, if you are trying to represent the approximate probability of winning the California lottery on one ticket, you would type

> 2.4e-8
[1] 2.4e-08
>

To enter 10⁹, you type

> 1e9
[1] 1e+09
> 1E9
[1] 1e+09
>

Notice: a capital E is not the same thing as a lower case e, but it's OK to use a capital E for exponent. You can use 1e9 or 1E9.

You must begin numerical literals with a digit. If you want 2 × 10³, you write 2e3. If you want 10³, think of 1 × 10³ and type 1e3. If you try to start with just e, R complains:

> e3 Error: Object "e3" not found > e+03 Error: Object "e" not found >

Without the starting digit, the interpreter does not understand you are talking about a number. The first time, it thought you were talking about something called "e3", and didn't know about anything with that name. The second time, it thought you were trying to add 3 to something called "e", and didn't know of anything called "e". (We'll talk about addition soon, and variable/object names soon too.) Start all numeric literals with a digit.

You cannot use decimal points after the e of a floating point literal. If you want to represent 10^8.3, this will not work:

> 1e8.3 Error: syntax error >

The interpreter does not consider this to be a grammatical sentence.

Exercise. How do you represent one thousand in exponential notation? What is 1e3? What is 10e3?

Only a finite number of bits are used to represent the exponent of a floating point number. Floating point numbers that are too big are converted to an internal format representing "infinity". This is designed for convenience in certain kinds of computations. However, the appearance of such quantities usually signifies an error, and this is called overflow:

> 1e308
[1] 1e+308
> 1e309
[1] Inf
>

R has indicated to you that an overflow condition has occurred by the use of the special representation Inf. Later I will show you an example of how this could arise in practice.

Similarly, there is a smallest quantity your system is capable of representing. Numbers smaller than this get treated as zero; this condition is called underflow. This can happen in practice also, as we'll see later in the class.

> 1e-323
[1] 9.881313e-323
> 1e-324
[1] 0
>

String literals

Not all your data are numeric. You may have text data, such as names, or state codes, for instance. You may represent text literals by using double quotes.

> "CA"
[1] "CA"
>

Nothing particularly exciting is happening yet, but we've seen that we can type in data between quotation marks and have the computer seem to accept it. This is called character data and we'll see much more of it as we move on. What you have just seen here is a character literal, a literal representation of a string formed by typing a double quote character, a sequence of other characters, and a close quote.

Here are some more:

"Galileo Galilei"

"Galilei, Galileo"

As you can see, it is OK to have things like commas or blank spaces in a character string.

In character strings, case counts. "A" and "a" do not represent the same string, because "A" and "a" are two different characters.

What if for some reason you would like to have a quotation mark in a string? If you just type a quotation mark, it ends the string. If (for instance) you wish to represent the following quotation¹, you will have to protect the quotation marks in the string by preceding them with a backslash character \:

> "\"The work of the philosophical policeman,\" replied the man in blue, \"is at once bolder and more subtle than that of the ordinary detective.\""

This is a string literal representing a string whose first character is a double quotation mark, whose second character is a capital T, whose third is a lower case h, and so forth. The backslash turns off the special meaning of the double quotation mark, allowing it to be interpreted as an ordinary character; we say that the backslash has escaped the quotation mark. We will see other examples of escape sequences.

The literal expression "" represents a blank string, a string of zero length that contains no characters. It comes in useful from time to time, as we'll see soon.

R distinguishes a numeric literal from a string containing characters representing digits. For instance, "2" is NOT the same as 2 to R. The first, "2" is a character string of length one; it is text; it is a string literal. The second, 2, is a numeric literal. The distinction is important: for instance, US ZIP codes are not, despite appearances, numeric. ZIP codes are strings of five digit characters; if you replace a ZIP code by a numerically equivalent representation, the US Post Office will not behave the same way. For instance, the ZIP code of Concord, Massachusetts² is 01742, but if you try to send a letter there with 1.742 ×10³ instead of a ZIP code, I'm not sure what will happen - but it will probably not involve your letter getting there. If ZIP codes were truly numeric, you could change to an equivalent numeric representation, add them, subtract them, etc. Yet you can't. ZIP codes don't behave like numbers; they can't be used like numbers - so in a way they must not be numbers.

The same thing may occur with other forms of text. If you are using codes such as 456789R00 to represent subjects, you don't want the text 123456E04 to be considered numeric data with the E representing the exponent!

The system hates to have an unmatched quotation mark at the end of the line:

> "CA
Error: syntax error
>

Always close your strings with a matching double quote.

You may also use single quotes as delimiters as well:

> 'CA'
[1] "CA"
> 'CA"'
[1] "CA\""
> "CA'"
[1] "CA'"
> 'CA"
Error: syntax error
> ''CA"
Error: syntax error
>

As far as the interpreter is concerned, 'CA' is considered to be equivalent to "CA" for this purpose. If you are writing a string with double quotes in it, you may use single quotes as the delimiter for the beginning and end. Then you may cheerfully put the double quote in as part of the string, knowing the system is waiting for another single quote to end the string; notice that when the system echoes the string back to you, it uses double quotes as the beginning and ending delimiter and escapes the internal double quote. In the third example, we used a single quote inside a string delimited by double quotes; the result echoed back to us shows the string delimited by double quotes and the single quote NOT escaped. The fourth example shows that a double quote does not match a single quote; the system thinks the string is unterminated and complains; a double quote can't end a string started with a single quote (and vice versa too. The final example shows that two single quotes don't equal a double quote.

Boolean literals

The symbols TRUE and FALSE represent logical truth and logical falsehood. Right now there is not much we can do with them.

> TRUE
[1] TRUE
>

As usual, TRUE and "TRUE" aren't the same thing. The first is a special symbol representing logical truth; the second is a text string with four characters.

> True
Error: Object "True" not found
>

Case matters. The boolean literal for truth is all uppercase.

You can't have any spaces inside the literal:

> TR UE
Error: syntax error
>

It thought after the R that you were through and got confused.

Simple Operations

Arithmetic

Now it's time to calculate something. R understands + and - for addition and subtraction.

> 2+3
[1] 5
> 5-1
[1] 4
>

R uses * for multiplication and / for division.

> 2*7
[1] 14
> 5/2.5
[1] 2
>

Multiplication and division are of higher precedence than addition and subtraction. When they appear together in an expression, the multiplication or division happens first unless otherwise indicated.

> 2*7+4
[1] 18
> 2*(7+4)
[1] 22
>

In the second example, the addition is forced to happen first by the use of parentheses.

Multiplication and division happen in right to left order:

> 20/4*5
[1] 25
>

If you add numbers with very different sizes, there may not be enough significant figures to represent the sum correctly. The floating point representation used by R uses only a finite number of significant places.

> 1+1e-15
[1] 1
> 1+1e-16
[1] 1
>

These look the same, but see if you understand the next lines:

> 1+1e-15-1
[1] 1.110223e-15
> 1+1e-16-1
[1] 0
> 1-1+1e-16
[1] 1e-16
>

In the first one, the computer added 1 and 10^-15 and had barely enough precision to express the result. It then subtracted one from the result and got the answer shown. In the second case, there was not enough precision to represent 1+10^-16 as anything other than just 1; the final subtraction yields zero. In the third case, the 1-1 yields zero, and then 1e-16 is added to zero. It is not that 1e-16 is too small for the computer to represent; it is the difference in magnitude between 1 and 1e-16 that causes the problem. The number 1 has only a single significant figure, and the number 1e-16 also has only a single significant figure. But 1+1e-16 has 17 significant figures.

The behavior we see exhibited by R is a consequence of the choice made by the designers to rely on the floating point capabilities of the underlying hardware. In the following fragment from SAS³ (in fact on a Sun Microsystems Solaris workstation), you see the same behavior:

6? data test;
7? vv=1+1e-15-1;
8? run;
9? proc print;
10? run;

Obs vv 1 1.1102E-15 NOTE: There were 1 observations read from the data set WORK.TEST.

Observe that the result of the computation 1+1e-15-1 is the same as we saw in R.

Arbitrary precision arithmetic routines are possible, but are slower, and should not be used unless the extra precision is needed.

Here are a few examples of computation.

We observe 117 individuals at baseline who are free of disease. At the one year follow-up, we observe 12 cases of disease. Compute an estimate of the incidence rate.
Solution. We observed 105 person-years among those who were uninfected, and we estimate on average 6 person-years for those who were infected. The incidence rate may then be computed by

> 12/(105+6)
[1] 0.1081081
>
Suppose that 51 needlestick injuries are observed where the needle was contaminated with hepatitis C positive blood⁴. Three infections occur. Estimate the probability of infection with hepatitis C following a needlestick in a similar population.
Solution.

> 3/51
[1] 0.05882353
>
Suppose 340 cases of tuberculosis are diagnosed in a year in a region with a population of 750000. Report this as the number of cases per 100000 per year.
Solution.

> 340*(100000/750000)
[1] 45.33333
>

To three significant figures, we report 45.3 cases per 100000 this year.
Suppose 3681 blood draws were done, but that seven of the needles were contaminated (at random). What is the probability that any particular draw was taken with a contaminated needle?
Solution.

> 7/3681
[1] 0.001901657
>

We report this to three significant figures as 1.90 × 10³.
Suppose 280 million people are to be vaccinated against smallpox, and that the probability of death following smallpox vaccination is one in one million. What is the expected number of deaths?
Solution.

> 280e6*1e-6
[1] 280
>
Suppose 17 cases of disease are seen in 1481 exposures, and 15 cases are seen in 4792 controls. Compute the odds ratio.
Solution.

> (17*(4792-15))/(15*(1481-17))
[1] 3.698042
>
Suppose 147 cases of rectal gonorrhea in men are reported out of 1793 total male cases from some region. If there is reason to believe that 20% of gonorrhea cases among males occur among MSMs, then what fraction of MSM cases involved rectal gonorrhea?
Solution.

> 147/(0.2*1793)
[1] 0.4099275
>

For this region, our estimate would be 41.0% to three significant figures. Notice it is essential to use the parentheses in the denominator. (These numbers are not meant to be realistic.)

Further operations

You can raise numbers to powers by using the ^ operator.

> 10^{^}3
[1] 1000
> 10^{^}-3
[1] 0.001
> 10^{^}2.2
[1] 158.4893
>

As you can see, the caret (^) operator is kind enough to handle negative numbers as well as arbitrary decimal numbers. So you see that 10^2.2 is approximately 158.4893. Incidentally, this means that the logarithm to the base 10 of 158.4893 is 2.2; logarithms are the inverse operation to taking powers.

Another pair of operators that comes in handy from time to time is the the "remainder after division" operator, %% and the "integer division" operator, %/%. As you saw earlier, R will gladly divide 5 by 2, for instance, to give 2.5. But sometimes you are interested in knowing that 2 goes into 5 only twice, with a remainder of one. Perhaps you are analyzing pill count data and the data show that one person was given 30 pills and was supposed to take 4 a day. So they should have had enough for 7 days, with 2 left over.

> 5 %% 2
[1] 1
> 5 %/% 2
[1] 2
> 30 %% 4
[1] 2
> 30 %/% 4
[1] 7
>

Calling Functions

Logarithms and square roots come up fairly often in statistics and data analysis. For instance, the standard deviation is the square root of the variance, data that may vary widely over many orders of magnitude (such as viral load data in HIV patients) are frequently transformed by taking the logarithm, and so forth. A desk calculator has a button with a square root on it, but the computer keyboard does not. What do we do in the command line interface to R?

Suppose we want to compute the square root of 54. One way to do it would be to start by observing that 7*7=49 and 8*8=64, so the square root is between 7 and 8. So we are looking for a number A so that A*A=54. We could then ask whether A is above or below 7.5; using R to square 7.5 gives 56.25. This is also larger than 54, so we conclude that the square root is between 7 and 7.5. Let's try to cut the interval in half again: 7.25 squared is 52.5625, so we now know the square root is between 7.25 and 7.5. One more: 7.375 squared is 54.39062, too big. So we have bracketed the answer we're looking for between 7.25 and 7.375. This is a simple example of what is called the bisection method in numerical analysis.

Of course for something as common as the square root, R has a built-in function to compute this. We don't have to do it ourselves by the bisection method or some more sophisticated numerical algorithm!

> sqrt(54)
[1] 7.348469
>

To call a function, we type its name, in this case sqrt. Then we type an open parenthesis, the information we want to give the function, followed by a close parenthesis. The information between the parentheses is called the argument of the function. You have also seen the quit function q, which does not compute anything - rather, it ends the R session. It takes no arguments, so it is called from the command line as q(). Sometimes I will write "the function q()", including the parentheses along with the name to emphasize that q is a function.

Here are some things to watch out for:

> s qrt(54)
Error: syntax error
>

You can't put a space in the middle of the name of the function.

> sqtr(54)
Error: couldn't find function "sqtr"
>

It knows sqrt, but not if you spell it worng.

> SQRT(54)
Error: couldn't find function "SQRT"
>

Case matters: uppercase letters are quite different from lowercase letters as far as R is concerned.

> sqrt[54]
Error in sqrt[54] : object is not subsettable
>

Don't use square brackets for a function call. The complaint that R made will make sense when we learn (next class) what square brackets are really for. Use matching parentheses for function calls.

> sqrt{54}
Error: syntax error
>

Curly braces don't work either. Use matching parentheses for function calls.

> sqrt(54]
Error: syntax error
> sqrt[54)
Error: syntax error
>

Use matching parentheses for function calls.

> sqrt(54(
+

What happened here? The computer gave you a plus sign for a command prompt, not a greater than sign. And it didn't say anything at all. R thinks you're not through speaking to it, because the last parenthesis is open. The plus sign is the continuation prompt. Sooner or later you'll type something like this by mistake that leads to the continuation prompt, keep hitting return, and keep getting the continuation prompt. If this happens, just type a semicolon to end the sentence. In this case, the sentence (or command) we gave R is malformed when we terminate it, so R complains, but we get a normal command prompt back. There's nothing bad about any of this; these things happen. We'll see the continuation prompt frequently; it's useful.

> sqrt("54")
Error in sqrt("54") : Non-numeric argument to mathematical function
>

This doesn't work either. A string "5" is not the same as a numeric 5.

> sqrt(54
+ )
[1] 7.348469
>

Here, we just forgot the close parenthesis. R thinks you are still talking to it, and it gives you the continuation prompt. In this case, we just add the close parenthesis at the continuation prompt. R understands what we wanted to do, and gives us the answer.

> sqrt(54,)
[1] 7.348469
>

R was able to figure this out, even though there was a comma added by mistake.

> sqrt (54)
[1] 7.348469
> sqrt( 54)
[1] 7.348469
> sqrt(54 )
[1] 7.348469
>

Spaces in between the function name and the parenthesis, or between the parentheses and the argument, are OK with R.

> sqrt(54,64)
Error: 2 arguments passed to "sqrt" which requires 1.
>

The comma is an argument separator. It turns out that some functions take more than one argument, and you must separate them with commas. But sqrt takes only one argument, and complains if you try to give it more than one argument at a time. Next class we will learn that the single argument can be a "plural noun" so to speak, so it can take many square roots at a time if you get the grammar right.

> sqrt(-54)
[1] NaN
Warning message:
NaNs produced in: sqrt(-54)
>

Here, the instruction was syntactically correct, but semantically incorrect. Our version of R is not set up to calculate the square root of a negative number; square roots of negative numbers are so-called imaginary numbers. In this case, we got a result which was printed NaN, short for "Not a Number", together with a warning message.

To reiterate, the correct form is

> sqrt(54)
[1] 7.348469
>

Other mathematical functions

Another useful function is abs, which takes the absolute value.

> abs(5)
[1] 5
> abs(-5)
[1] 5
>

Here is how to compute the logarithm of 100 to base 10:

> log(100, 10)
[1] 2
> log(158.4893, 10)
[1] 2.2
>

The logarithm function can take two arguments: the first is what you want the logarithm of, and the second is the base for the logarithms.

If you call the logarithm function with only one argument, the logarithms are computed with respect to the quantity e, the base of the natural logarithms, which is about 2.718.

> log(10)
[1] 2.302585
>

The inverse of the natural logarithms is called the exponential function, and is called exp in R:

> exp(2.302585)
[1] 10
>

Other utility functions that are handy are useful for rounding numbers. These are ceiling, which goes up to the next largest integer, floor, which goes down to the next smallest integer, round, which rounds off to the nearest integer (subject to a few details given in the manual), signif which can round a number to a specified number of significant figures, and trunc, which the truncates the fractional part. We'll say more about them later.

R also has a full complement of various common mathematical functions such as sin, cos, and tan. These trigonometric functions come up from time to time, and their argument must be expressed in radians, not degrees. The inverse functions are available as well: asin, acos, and atan. The answer is returned from these in radians.

Comparisons

We've seen operators and functions that take numbers and return other numbers. For instance, the plus sign takes two numbers and returns a third. The square root function accepts a number and returns another number. (In fact, the plus sign is simply a convenient notation for the addition function.)

But we're now going to look at some important operators which take two numbers, and return a boolean (TRUE or FALSE) value. Specifically, we'll look at the comparison operators, >, <, >=, <, == (equal to), and != (not equal to). The use of these is shown below:

> 2<3
[1] TRUE
> 2>3
[1] FALSE
> 2>=2
[1] TRUE
> 2>2
[1] FALSE
> 2==2
[1] TRUE
> 2!=2
[1] FALSE
>

We can already do things that don't seem quite so trivial:

> sqrt(8000)<=89.5
[1] TRUE
>

And we can see a few surprises:

> 2/7 + 2/7 + 2/7 + 2/7 + 2/7 + 2/7 + 2/7
[1] 2
> 2/7 + 2/7 + 2/7 + 2/7 + 2/7 + 2/7 + 2/7 == 2
[1] FALSE
>

What's this? Why can't I add 2/7 up seven times and get 2? What you are seeing is roundoff error in action. The result of the addition is very close to 2, so close that it prints as 2. But it is very slightly different due to roundoff error, so the computer does not find them the same: the == operator returns FALSE.

This is of fundamental importance in working with floating point data. Do not perform equality checks using ==. Sometimes it will work. But frequently it won't. When checking for the equality of real numbers, always compare them to within a suitably small tolerance:

> abs((2/7 + 2/7 + 2/7 + 2/7 + 2/7 + 2/7 + 2/7) - 2) < 1e-10
[1] TRUE
>

Exercise. Imagine that you have loaded some data into R (using methods I'll show you later). You print out the first three values, and they all appear to be 20. Yet when you ask R whether they are <= 20, it returns FALSE. Why?

Functions for character data

We've seen functions (like sqrt) and operators (like +) for working with numeric data. Now we'll look at one function for working with character data.

To join two strings together to produce a new string, we use the paste function. This function is quite general and useful, and we'll have to tell it to just join the strings with nothing in between them. This will require a special syntax which we'll learn more about later. For now, here is how we may join two strings⁵.

> paste("There are more things in heaven and earth, Horatio","Than are dreamt of in your philosophy.",sep="")

If you leave off the sep="", then R will by default separate the strings with a blank. By giving R the blank string, you are telling it to separate the strings with nothing at all - to just join them together.

Exercise. Find the mistake in the example of the Shakespeare quote.

Operators for boolean data

We saw the boolean literals TRUE and FALSE already. And we've seen how we can use the six comparison operators in order to compare numeric data and thereby yield boolean answers.

But what do we do with boolean values? We've all seen examples such as:

Exclude a subject from the study if they are over 70 or if they are currently using metformin.
Include subjects only if they are between the ages of 18 and 35 and reside in a zip code beginning with 94 or 95 and report using intravenous drugs at least once
Find a book one of whose authors is "Fenner" and whose title contains the word "Eradication"
Find a web page containing either the word "anthrax" or "anthracis", the word "human", but not the word "music" or "metal"

R has operators which allow us to manipulate logical values in these ways. Today we will learn the operators & (and), | (or), and ! (not).

The use of these is illustrated here:

> 4 < sqrt(17)& 2<3
[1] TRUE
> 4 < sqrt(17)& 2>3
[1] FALSE
> 4 < sqrt(17)| 2>3
[1] TRUE
> 4 > sqrt(17)| 2>3
[1] FALSE
> !(2 > 3)
[1] TRUE
>

The operator ! has very high precedence; we placed the parentheses around the comparison to make sure that the comparison happened first. The not operator ! has higher precedence than the comparisons. But the comparisons have higher precedence than &, which in turn has higher precedence than !. Use parentheses when in doubt.

Functions for examining data

Another useful class of functions are those which examine data objects in R. We'll look at three of these today: is.character, is.numeric, and is.logical.

> is.numeric(2)
[1] TRUE
> is.numeric("2")
[1] FALSE
> is.character(2)
[1] FALSE
> is.character("2")
[1] TRUE
> is.logical((2 < 3))
[1] TRUE
>

Here, it's obvious whether a literal is numeric or not. But we will need these functions later when we deal with non-literal data, such as a data set we read in from a file. The function is.logical tests whether a value is boolean or not; experiment with it.

Proceed to Functions, Vectors, and Repetition.

¹ The Man who was Thursday, by G. K. Chesterton.

² Near where the battle of Lexington and Concord occurred in 1775.

³ For our course in R programming, you are not expected to understand SAS code. The SAS code, however, demonstrates something about R - namely, that some behaviors of R reflect the behavior of the underlying hardware, and other software systems that use the hardware in the same way show the same behavior. When such examples from other languages are used, we will always code the transcript in blue (as we did this time).

⁴ Hamid SS, Farooqui B, Rizvi Q, Sultana T, Siddiqui AA. Risk of transmission and features of hepatitis C after needlestick injuries. Infection control and hospital epidemiology 20:63-64, 1999.

⁵ Hamlet