Biostatistics Final Examination, Fall 2001.

Travis Porco, Instructor

Administrative Details

This examination is due at 6:00 PM on December 19, 2001. There will be no exceptions.
You may FAX the examination to me at 415-431-7029; you may email the examination as a *.pdf, *.html, or *.dvi file (no MS Word attachments please); you may mail the examination to me at PO Box 4392, Berkeley CA 94704; you may deliver the examination to me at 25 Van Ness Avenue, Suite 710, San Francisco. If you choose to mail the examination, it is your responsibility to make sure it gets to me by 6:00 PM on December 19.
Please use precise, legible, clear, and organized language. Present your ideas in a complete and professional manner. Type your work; use a spell-checker; use complete sentences and organized paragraphs.
Present your work in your own language. Two identical papers will both receive failing grades, without exception.

Scope

Question 1

Explain how inferential statistics is designed to help us avoid misleading conclusions based on chance patterns which might arise in our data. Give an example.

Question 2

Explain why biostatistics is important in both public health practice and in public health advocacy. Why is good science important to public health? You might wish to discuss such issues as the ability of advocacy to help the right questions get asked, and to implement public health recommendations based on good science. Give specific examples from your work or from the class.

Definitions

Question 3

Give precise, complete, and unambiguous definitions and explanations of the following ideas, with examples if necessary. Choose three out of the four.

Sample mean
Sample variance
Relative frequency
Population mean

Question 4

Give precise, complete, and unambiguous definitions and explanations of the following ideas, with examples if necessary. Choose three out of the four.

Null hypothesis
Type I Error
Type II Error
Independence of events

Question 5

Summarize your knowledge about the following:

Histogram
Normal distribution
Measure of spread

Question 6

Summarize your knowledge about the following. Give a definition, significance, and applications; explain how the concepts relate to other concepts in the class. Choose four out of five.

Binomial distribution
Random sample
Sample Covariance
Sample Correlation Coefficient
Confidence interval

Probability

Question 7

Someone says, "I do not believe that smoking causes cancer, since my grandmother smoked 4 packs of cigarettes every day and lived to be 93." Give a precise, complete response to this, free of mathematical jargon, and yet answering this objection. You might want to consider such issues as the probabilistic nature of our present understanding of epidemiologic risk factors, and the difficulty of predicting individual outcomes from these probabilities.

Question 8

Someone says, "My husband and I are both sickle cell carriers. Our doctor says that we should expect one in four of our children to have sickle cell disease, and we have had three healthy children. We wish to have a fourth child, but we believe that the doctor's remarks indicate that the next child is certain to have sickle cell disease." Give a precise, complete, and unambiguous response to this, free of mathematical jargon. Your might want to consider such issues as the independence of events or the relative frequency definition of probability; what did the doctor really mean by "expect one in four to have sickle cell disease"?

Question 9

Choose three out of five of the following. Summarize your knowledge about:

The chi-square test of independence in a two-by-two table.
The Z test whether an observed proportion equals a known value
The two-sample T-test
One-way analysis of variance
Linear regression

In particular, explain in words and symbols what the null hypothesis is in each case. What is the test statistic? What is the distribution of the test statistic given the null hypothesis? Give an example when you might use each of these methods.

Question 10

Suppose you are given the following data to analyze:

X	Y
50	115.8
51	115.5
52	110.9
53	107.6
54	116.2
55	109.3
56	110.5
57	107.6
58	116.0
59	113.6
60	112.3
61	119.5
62	130.0
63	122.0
64	115.2
65	133.2
66	132.3
67	142.6
68	132.7
69	135.4
70	132.8

First, enter these data into Rweb, by proceeding to the Rweb window and entering
xx <- c(50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70)
and then
yy <- c(115.8,115.5,110.9,107.6,116.2,109.3,110.5, 107.6,116.0,113.6,112.3,119.5,130.0,122.0,115.2,133.2, 132.3,142.6,132.7,135.4,132.8) Type these both into the Rweb window. Then copy these into the clipboard of your machine, and save it. Then hit the submit button, and see what happens. You must enter these data in exactly the form you see; do not type a period when you need a comma, or a comma when you need a period; do not put a space between the symbols < and -; do not omit a parenthesis. If there is a mistake or a problem, the machine will respond with an error message and "Execution halted". If this happens, use the back button and return to the entry window; if your text is still there, correct it and try to submit again; if your text has vanished, paste it back it and fix it; save it again and hit submit again.

Feel free to ask me any computer questions.

Once you have been able to enter the data without an error, add a plot command to the end of your statements:
plot(xx,yy) Your Rweb entry box should contain the command starting with xx, the command starting with yy, and then the plot command. Hit submit. This plot command will produce you a scatterplot; print it and interpret it. What does it mean? Which variable is the predictor; which is the response? How would you describe the relationship?

Now, we are going to add some more commands to do the analysis using Rweb. Use the back button to go back to the entry window. If your text is there, you will add more to it. If your text is not there, paste it back in from the clipboard and then add some more commands to it.

To analyze the data in Rweb, we will use the built in command for linear modeling as follows:
cor(xx,yy) our.model <- lm(yy ~ xx) summary.lm(our.model) What do these incantations do? The first one calculates the sample correlation coefficient of the data. If you mistakenly left out one of the xx values or one of the yy's, so that they don't match, you'll get an error. Then, the lm command itself tries to model yy (your response) in terms of the predictor xx. If you decided to try to predict xx from yy instead, you would write lm(xx ~ yy). The lm command does the work, and we save it to a variable called our.model. The second command, called summary.lm, prints out a summary of the analysis.

Remember that in simple linear regression, we try to find the straight line that best fits the data. We get both a slope and an intercept, which are called coefficients. Find these on the output. The coefficient for the slope is labeled with the name of the predictor it applies to (in this case, xx). Does the value look reasonable, based on your scatterplot? (Remember: slope is the rise over the run.)

Find the standard error of the estimated slope from the computer output. Verify that the t value given in the third column equals the estimated slope divided by the standard error.

Verify that the Multiple R-Squared given on the printout equals the square of the correlation coefficient.

Now, find the p-value for the test whether the regression slope is not zero. What is your conclusion?

This is the end of the examination.