Biostatistics Final Examination, Fall 2001.
Travis Porco, Instructor
Administrative Details
-
This examination is due at 6:00 PM on December 19, 2001.
There will be no exceptions.
You may FAX the examination to me at 415-431-7029; you may
email the examination as a *.pdf, *.html, or *.dvi file (no
MS Word attachments please); you may mail the examination to
me at PO Box 4392, Berkeley CA 94704; you may deliver the
examination to me at 25 Van Ness Avenue, Suite 710, San
Francisco. If you choose to mail the examination, it is
your responsibility to make sure it gets to me by 6:00 PM
on December 19.
-
Please use precise, legible, clear, and organized language.
Present your ideas in a complete and professional manner.
Type your work; use a spell-checker; use complete sentences and
organized paragraphs.
-
Present your work in your own language. Two identical papers will
both receive failing grades, without exception.
Scope
Question 1
Explain how inferential statistics is designed to help us avoid
misleading conclusions based on chance patterns which might arise
in our data. Give an example.
Question 2
Explain why biostatistics is important in both public health
practice and in public health advocacy. Why is good science
important to public health? You might wish to discuss such issues as
the ability of advocacy to help the right questions get asked, and
to implement public health recommendations based on good science.
Give specific examples from your work or from the class.
Definitions
Question 3
Give precise, complete, and unambiguous definitions and explanations
of the following ideas, with examples if necessary. Choose three out
of the four.
- Sample mean
- Sample variance
- Relative frequency
- Population mean
Question 4
Give precise, complete, and unambiguous definitions and explanations
of the following ideas, with examples if necessary. Choose three out
of the four.
- Null hypothesis
- Type I Error
- Type II Error
- Independence of events
Question 5
Summarize your knowledge about the following:
- Histogram
- Normal distribution
- Measure of spread
Question 6
Summarize your knowledge about the following. Give a definition,
significance, and applications; explain how the concepts relate
to other concepts in the class. Choose four out of five.
- Binomial distribution
- Random sample
- Sample Covariance
- Sample Correlation Coefficient
- Confidence interval
Probability
Question 7
Someone says, "I do not believe that smoking causes cancer,
since my grandmother smoked 4 packs of cigarettes every day and
lived to be 93." Give a precise, complete response to this,
free of mathematical jargon, and yet answering this objection. You
might want to consider such issues as the probabilistic nature of
our present understanding of epidemiologic risk factors, and the
difficulty of predicting individual outcomes from these probabilities.
Question 8
Someone says, "My husband and I are both sickle cell carriers.
Our doctor says that we should expect one in four of our children
to have sickle cell disease, and we have had three healthy children.
We wish to have a fourth child, but we believe that the doctor's
remarks indicate that the next child is certain to have sickle cell
disease." Give a precise, complete, and unambiguous response
to this, free of mathematical jargon. Your might want to consider
such issues as the independence of events or the relative frequency
definition of probability; what did the doctor really mean by
"expect one in four to have sickle cell disease"?
Question 9
Choose three out of five of the following. Summarize your knowledge
about:
- The chi-square test of independence in a two-by-two table.
- The Z test whether an observed proportion equals a known value
- The two-sample T-test
- One-way analysis of variance
- Linear regression
In particular, explain in words and symbols what the null hypothesis
is in each case. What is the test statistic? What is the distribution
of the test statistic given the null hypothesis? Give an example when
you might use each of these methods.
Question 10
Suppose you are given the following data to analyze:
X | Y |
50 | 115.8 |
51 | 115.5 |
52 | 110.9 |
53 | 107.6 |
54 | 116.2 |
55 | 109.3 |
56 | 110.5 |
57 | 107.6 |
58 | 116.0 |
59 | 113.6 |
60 | 112.3 |
61 | 119.5 |
62 | 130.0 |
63 | 122.0 |
64 | 115.2 |
65 | 133.2 |
66 | 132.3 |
67 | 142.6 |
68 | 132.7 |
69 | 135.4 |
70 | 132.8 |
First, enter these data into Rweb, by proceeding to the Rweb window
and entering
xx <- c(50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70)
and then
yy <- c(115.8,115.5,110.9,107.6,116.2,109.3,110.5,
107.6,116.0,113.6,112.3,119.5,130.0,122.0,115.2,133.2,
132.3,142.6,132.7,135.4,132.8)
Type these both into the Rweb window. Then copy these into the
clipboard of your machine, and save it. Then hit the submit button,
and see what happens. You must enter these data in exactly the
form you see; do not type a period when you need a comma, or a
comma when you need a period; do not put a space between the
symbols < and -; do not omit a parenthesis. If there is a mistake
or a problem, the machine will respond with an error message and
"Execution halted". If this happens, use the back button and
return to the entry window; if your text is still there, correct
it and try to submit again; if your text has vanished, paste it back it and
fix it; save it again and hit submit again.
Feel free to ask me any computer questions.
Once you have been able to enter the data without an error,
add a plot command to the end of your statements:
plot(xx,yy)
Your Rweb entry box should contain the command starting with xx,
the command starting with yy, and then the plot command. Hit submit.
This plot command will produce you a scatterplot; print it and
interpret it. What does it mean? Which variable is the predictor;
which is the response? How would you describe the relationship?
Now, we are going to add some more commands to do the analysis
using Rweb. Use the back button to go back to the entry window.
If your text is there, you will add more to it. If your text is
not there, paste it back in from the clipboard and then add some
more commands to it.
To analyze the data in Rweb, we will use the built in command for
linear modeling as follows:
cor(xx,yy)
our.model <- lm(yy ~ xx)
summary.lm(our.model)
What do these incantations do? The first one calculates the
sample correlation coefficient of the data. If you mistakenly
left out one of the xx values or one of the yy's, so that they
don't match, you'll get an error.
Then, the lm command itself
tries to model yy (your response) in terms of the predictor xx.
If you decided to try to predict xx from yy instead, you would
write lm(xx ~ yy). The lm command does the work, and we save it
to a variable called our.model. The second command, called summary.lm,
prints out a summary of the analysis.
Remember that in simple linear regression, we try to find the
straight line that best fits the data. We get both a slope and
an intercept, which are called coefficients. Find these on the
output. The coefficient for the slope is labeled with the name of
the predictor it applies to (in this case, xx). Does the value
look reasonable, based on your scatterplot? (Remember: slope is
the rise over the run.)
Find the standard error of the estimated slope from the computer
output. Verify that the t value given in the third column equals
the estimated slope divided by the standard error.
Verify that the Multiple R-Squared given on the printout equals
the square of the correlation coefficient.
Now, find the p-value for the test whether the regression slope
is not zero. What is your conclusion?
This is the end of the examination.