Home About us Mathematical Epidemiology Rweb EPITools Statistics Notes Web Design Search Contact us 
We are working with the command line interface to the R interpreter.
Working with R in this way is much like using a desk calculator. You type an expression for R to evaluate. R reads the expression, evaluates it, and prints the result.
How you start up R depends on your computer. On many systems you may simply click or double click on an icon representing the program R.
When you finish an R session, you may quit by typing q(), the quit function. Depending on your system, you may be able to quit by using a menu interface.
When you quit an R session, R gives you an opportunity to save your session. If you decline to save the workspace, then the session is lost and R will not remember what you did next time you begin. If you agree to save the session, then R will start next time where you left off this time.
Exercise. Start the R interpreter and see that it is working properly. Then exit by using the quit function q().
The simplest type of expression you may enter for R to evaluate is simply a literal. A literal is an expression which is supposed to stand for itself, so to speak.
Numeric literals represent numbers.
For example, you may type ordinary integers into the interpreter:
> 2 [1] 2 > 
This is certainly not a hard computation. You asked it to compute the value of 2, and it let you know that the result is just 2. Here, we simply typed the literal 2, which represented the number two.
Numbers you type in to R are normally converted to floating point representation. In a floating point representation, the significant figures are stored separately from the exponent. Only a finite number of significant figures are kept.
For instance, if you were computing with pencil and paper, you might approximate one third by 0.3333333. For many purposes this might be close enough to one third, close enough that the difference would not matter for your purpose; what is good enough for one purpose might not be good enough for another. But conceptually, 0.3333333 is NOT equal to one third. The difference is called roundoff error. We will discuss this more in subsequent lectures.
R tries to minimize roundoff error by keeping more digits internally than you will need for many applications. Internally, R tries to take advantage of double precision floating point representations of your numbers.
Floating point literals may include the exponent. To represent 2.2 × 10^{8}:
> 2.2e8 [1] 2.2e+08 > 
Notice that R adds the sign of the exponent and prefixes the eight with a zero. It is OK if you add the plus sign:
> 2.2e+8 [1] 2.2e+08 > 
If you are trying to represent a small number, you use a negative number. For instance, if you are trying to represent the approximate probability of winning the California lottery on one ticket, you would type
> 2.4e8 [1] 2.4e08 > 
To enter 10^{9}, you type
> 1e9 [1] 1e+09 > 1E9 [1] 1e+09 > 
You must begin numerical literals with a digit. If you want 2 × 10^{3}, you write 2e3. If you want 10^{3}, think of 1 × 10^{3} and type 1e3. If you try to start with just e, R complains:
> e3 Error: Object "e3" not found > e+03 Error: Object "e" not found > 
You cannot use decimal points after the e of a floating point literal. If you want to represent 10^{8.3}, this will not work:
> 1e8.3 Error: syntax error > 
Exercise. How do you represent one thousand in exponential notation? What is 1e3? What is 10e3?
Only a finite number of bits are used to represent the exponent of a floating point number. Floating point numbers that are too big are converted to an internal format representing "infinity". This is designed for convenience in certain kinds of computations. However, the appearance of such quantities usually signifies an error, and this is called overflow:
> 1e308 [1] 1e+308 > 1e309 [1] Inf > 
Similarly, there is a smallest quantity your system is capable of representing. Numbers smaller than this get treated as zero; this condition is called underflow. This can happen in practice also, as we'll see later in the class.
> 1e323 [1] 9.881313e323 > 1e324 [1] 0 > 
Not all your data are numeric. You may have text data, such as names, or state codes, for instance. You may represent text literals by using double quotes.
> "CA" [1] "CA" > 
Here are some more:
"Galileo Galilei" 
"Galilei, Galileo" 
In character strings, case counts. "A" and "a" do not represent the same string, because "A" and "a" are two different characters.
What if for some reason you would like to have a quotation mark in a string? If you just type a quotation mark, it ends the string. If (for instance) you wish to represent the following quotation^{1}, you will have to protect the quotation marks in the string by preceding them with a backslash character \:
> "\"The work of the philosophical policeman,\" replied the man in blue, \"is at once bolder and more subtle than that of the ordinary detective.\"" 
The literal expression "" represents a blank string, a string of zero length that contains no characters. It comes in useful from time to time, as we'll see soon.
R distinguishes a numeric literal from a string containing characters representing digits. For instance, "2" is NOT the same as 2 to R. The first, "2" is a character string of length one; it is text; it is a string literal. The second, 2, is a numeric literal. The distinction is important: for instance, US ZIP codes are not, despite appearances, numeric. ZIP codes are strings of five digit characters; if you replace a ZIP code by a numerically equivalent representation, the US Post Office will not behave the same way. For instance, the ZIP code of Concord, Massachusetts^{2} is 01742, but if you try to send a letter there with 1.742 ×10^{3} instead of a ZIP code, I'm not sure what will happen  but it will probably not involve your letter getting there. If ZIP codes were truly numeric, you could change to an equivalent numeric representation, add them, subtract them, etc. Yet you can't. ZIP codes don't behave like numbers; they can't be used like numbers  so in a way they must not be numbers.
The same thing may occur with other forms of text. If you are using codes such as 456789R00 to represent subjects, you don't want the text 123456E04 to be considered numeric data with the E representing the exponent!
The system hates to have an unmatched quotation mark at the end of the line:
> "CA Error: syntax error > 
You may also use single quotes as delimiters as well:
> 'CA' [1] "CA" > 'CA"' [1] "CA\"" > "CA'" [1] "CA'" > 'CA" Error: syntax error > ''CA" Error: syntax error > 
The symbols TRUE and FALSE represent logical truth and logical falsehood. Right now there is not much we can do with them.
> TRUE [1] TRUE > 
> True Error: Object "True" not found > 
> TR UE Error: syntax error > 
Now it's time to calculate something. R understands + and  for addition and subtraction.
> 2+3 [1] 5 > 51 [1] 4 > 
R uses * for multiplication and / for division.
> 2*7 [1] 14 > 5/2.5 [1] 2 > 
Multiplication and division are of higher precedence than addition and subtraction. When they appear together in an expression, the multiplication or division happens first unless otherwise indicated.
> 2*7+4 [1] 18 > 2*(7+4) [1] 22 > 
Multiplication and division happen in right to left order:
> 20/4*5 [1] 25 > 
If you add numbers with very different sizes, there may not be enough significant figures to represent the sum correctly. The floating point representation used by R uses only a finite number of significant places.
> 1+1e15 [1] 1 > 1+1e16 [1] 1 > 
> 1+1e151 [1] 1.110223e15 > 1+1e161 [1] 0 > 11+1e16 [1] 1e16 > 
The behavior we see exhibited by R is a consequence of the choice made by
the designers to rely on the floating point capabilities of the underlying
hardware.
In the
following fragment from SAS^{3} (in fact on a Sun Microsystems Solaris workstation),
you see the same behavior:
6? data test; 7? vv=1+1e151; 8? run; 9? proc print; 10? run; 
Obs vv

Arbitrary precision arithmetic routines are possible, but are slower, and should not be used unless the extra precision is needed.
Here are a few examples of computation.
> 12/(105+6) [1] 0.1081081 > 
> 3/51 [1] 0.05882353 > 
> 340*(100000/750000) [1] 45.33333 > 
> 7/3681 [1] 0.001901657 > 
> 280e6*1e6 [1] 280 > 
> (17*(479215))/(15*(148117)) [1] 3.698042 > 
> 147/(0.2*1793) [1] 0.4099275 > 
You can raise numbers to powers by using the ^ operator.
> 10^{^}3 [1] 1000 > 10^{^}3 [1] 0.001 > 10^{^}2.2 [1] 158.4893 > 
Another pair of operators that comes in handy from time to time is the the "remainder after division" operator, %% and the "integer division" operator, %/%. As you saw earlier, R will gladly divide 5 by 2, for instance, to give 2.5. But sometimes you are interested in knowing that 2 goes into 5 only twice, with a remainder of one. Perhaps you are analyzing pill count data and the data show that one person was given 30 pills and was supposed to take 4 a day. So they should have had enough for 7 days, with 2 left over.
> 5 %% 2 [1] 1 > 5 %/% 2 [1] 2 > 30 %% 4 [1] 2 > 30 %/% 4 [1] 7 > 
Logarithms and square roots come up fairly often in statistics and data analysis. For instance, the standard deviation is the square root of the variance, data that may vary widely over many orders of magnitude (such as viral load data in HIV patients) are frequently transformed by taking the logarithm, and so forth. A desk calculator has a button with a square root on it, but the computer keyboard does not. What do we do in the command line interface to R?
Suppose we want to compute the square root of 54. One way to do it would be to start by observing that 7*7=49 and 8*8=64, so the square root is between 7 and 8. So we are looking for a number A so that A*A=54. We could then ask whether A is above or below 7.5; using R to square 7.5 gives 56.25. This is also larger than 54, so we conclude that the square root is between 7 and 7.5. Let's try to cut the interval in half again: 7.25 squared is 52.5625, so we now know the square root is between 7.25 and 7.5. One more: 7.375 squared is 54.39062, too big. So we have bracketed the answer we're looking for between 7.25 and 7.375. This is a simple example of what is called the bisection method in numerical analysis.
Of course for something as common as the square root, R has a builtin function to compute this. We don't have to do it ourselves by the bisection method or some more sophisticated numerical algorithm!
> sqrt(54) [1] 7.348469 > 
To call a function, we type its name, in this case sqrt. Then we type an open parenthesis, the information we want to give the function, followed by a close parenthesis. The information between the parentheses is called the argument of the function. You have also seen the quit function q, which does not compute anything  rather, it ends the R session. It takes no arguments, so it is called from the command line as q(). Sometimes I will write "the function q()", including the parentheses along with the name to emphasize that q is a function.
Here are some things to watch out for:
> s qrt(54) Error: syntax error > 
> sqtr(54) Error: couldn't find function "sqtr" > 
> SQRT(54) Error: couldn't find function "SQRT" > 
> sqrt[54] Error in sqrt[54] : object is not subsettable > 
> sqrt{54} Error: syntax error > 
> sqrt(54] Error: syntax error > sqrt[54) Error: syntax error > 
> sqrt(54( + 
> sqrt("54") Error in sqrt("54") : Nonnumeric argument to mathematical function > 
> sqrt(54 + ) [1] 7.348469 > 
> sqrt(54,) [1] 7.348469 > 
> sqrt (54) [1] 7.348469 > sqrt( 54) [1] 7.348469 > sqrt(54 ) [1] 7.348469 > 
> sqrt(54,64) Error: 2 arguments passed to "sqrt" which requires 1. > 
> sqrt(54) [1] NaN Warning message: NaNs produced in: sqrt(54) > 
To reiterate, the correct form is
> sqrt(54) [1] 7.348469 > 
Another useful function is abs, which takes the absolute value.
> abs(5) [1] 5 > abs(5) [1] 5 > 
Here is how to compute the logarithm of 100 to base 10:
> log(100, 10) [1] 2 > log(158.4893, 10) [1] 2.2 > 
If you call the logarithm function with only one argument, the logarithms are computed with respect to the quantity e, the base of the natural logarithms, which is about 2.718.
> log(10) [1] 2.302585 > 
The inverse of the natural logarithms is called the exponential function, and is called exp in R:
> exp(2.302585) [1] 10 > 
Other utility functions that are handy are useful for rounding numbers. These are ceiling, which goes up to the next largest integer, floor, which goes down to the next smallest integer, round, which rounds off to the nearest integer (subject to a few details given in the manual), signif which can round a number to a specified number of significant figures, and trunc, which the truncates the fractional part. We'll say more about them later.
R also has a full complement of various common mathematical functions such as sin, cos, and tan. These trigonometric functions come up from time to time, and their argument must be expressed in radians, not degrees. The inverse functions are available as well: asin, acos, and atan. The answer is returned from these in radians.
We've seen operators and functions that take numbers and return other numbers. For instance, the plus sign takes two numbers and returns a third. The square root function accepts a number and returns another number. (In fact, the plus sign is simply a convenient notation for the addition function.)
But we're now going to look at some important operators which take two numbers, and return a boolean (TRUE or FALSE) value. Specifically, we'll look at the comparison operators, >, <, >=, <, == (equal to), and != (not equal to). The use of these is shown below:
> 2<3 [1] TRUE > 2>3 [1] FALSE > 2>=2 [1] TRUE > 2>2 [1] FALSE > 2==2 [1] TRUE > 2!=2 [1] FALSE > 
We can already do things that don't seem quite so trivial:
> sqrt(8000)<=89.5 [1] TRUE > 
And we can see a few surprises:
> 2/7 + 2/7 + 2/7 + 2/7 + 2/7 + 2/7 + 2/7 [1] 2 > 2/7 + 2/7 + 2/7 + 2/7 + 2/7 + 2/7 + 2/7 == 2 [1] FALSE > 
This is of fundamental importance in working with floating point data. Do not perform equality checks using ==. Sometimes it will work. But frequently it won't. When checking for the equality of real numbers, always compare them to within a suitably small tolerance:
> abs((2/7 + 2/7 + 2/7 + 2/7 + 2/7 + 2/7 + 2/7)  2) < 1e10 [1] TRUE > 
Exercise. Imagine that you have loaded some data into R (using methods I'll show you later). You print out the first three values, and they all appear to be 20. Yet when you ask R whether they are <= 20, it returns FALSE. Why?
We've seen functions (like sqrt) and operators (like +) for working with numeric data. Now we'll look at one function for working with character data.
To join two strings together to produce a new string, we use the paste function. This function is quite general and useful, and we'll have to tell it to just join the strings with nothing in between them. This will require a special syntax which we'll learn more about later. For now, here is how we may join two strings^{5}.
> paste("There are more things in heaven and earth, Horatio","Than are dreamt of in your philosophy.",sep="") 
Exercise. Find the mistake in the example of the Shakespeare quote.
We saw the boolean literals TRUE and FALSE already. And we've seen how we can use the six comparison operators in order to compare numeric data and thereby yield boolean answers.
But what do we do with boolean values? We've all seen examples such as:
R has operators which allow us to manipulate logical values in these ways. Today we will learn the operators & (and),  (or), and ! (not).
The use of these is illustrated here:
> 4 < sqrt(17)& 2<3 [1] TRUE > 4 < sqrt(17)& 2>3 [1] FALSE > 4 < sqrt(17) 2>3 [1] TRUE > 4 > sqrt(17) 2>3 [1] FALSE > !(2 > 3) [1] TRUE > 
Another useful class of functions are those which examine data objects in R. We'll look at three of these today: is.character, is.numeric, and is.logical.
> is.numeric(2) [1] TRUE > is.numeric("2") [1] FALSE > is.character(2) [1] FALSE > is.character("2") [1] TRUE > is.logical((2 < 3)) [1] TRUE > 
^{1} The Man who was Thursday, by G. K. Chesterton.
^{2} Near where the battle of Lexington and Concord occurred in 1775.
^{3} For our course in R programming, you are not expected to understand SAS code. The SAS code, however, demonstrates something about R  namely, that some behaviors of R reflect the behavior of the underlying hardware, and other software systems that use the hardware in the same way show the same behavior. When such examples from other languages are used, we will always code the transcript in blue (as we did this time).
^{4} Hamid SS, Farooqui B, Rizvi Q, Sultana T, Siddiqui AA. Risk of transmission and features of hepatitis C after needlestick injuries. Infection control and hospital epidemiology 20:6364, 1999.
^{5} Hamlet