Home About us Mathematical Epidemiology Rweb EPITools Statistics Notes Web Design Search Contact us |
In this lecture, I'll discuss some ways to use R to analyze data from the San Francisco Young Men's Health Study1. We'll demonstrate a few simple programming techniques, and use a modified version of the real data (i.e. fictitious data!) for confidentiality. Some of the results in this web page (using the real data, of course) are to appear shortly2 and the background literature, actual scientific results, and discussion are to be found in that reference.
The problem of how infectious HIV is is not a simple one. Here, we'll focus on one approach, that of per-partnership infectivity3. By infectivity, we mean the probability that an individual is infected during a partnership given that the partner is infected.
Now, what constitutes a partnership? We are going to restrict our attention to partnerships in which the subject undergoes receptive anal intercourse one or more times. We'll further distinguish between partnerships in which the person reports that condoms were always used (protected partnerships), and partnerships for which condoms were not always used (unprotected partnerships).
What we wanted to know was whether there was any evidence that the infectivity had declined after the widespread use of highly active antiretroviral therapy began in 1996. Of course, the causal inference is entirely circumstantial, in that we have no control group and no assessment of the exposure of partners of men in the study to HAART. And moreover, new subjects weren't recruited each year into the study, leaving open the possibility of bias due to aging and differential loss to follow-up.
The San Francisco Young Men's Health Study itself is a multistage probability sample (which began in 1992) of single men aged 18-29 who resided in 21 census tracts with the highest cumulative AIDS incidence1. It included 428 homosexual men at baseline and then enrolled an additional 622 referral subjects at the next follow-up time. At each approximately yearly study visit, subjects were tested for HIV antibodies and were asked, with respect to the previous 12 months, for their (1) number of sex partners, (2) number of sex partners under 30 years of age, (3) number of receptive anal intercourse partners, and (4) number of receptive anal intercourse partners with whom condoms were always used. The last four study visits used consistent questions about behavior and provided two periods at risk both before (4/94 to 9/95; and 9/95 to 11/96) and after (11/96 to 9/97; and 9/97 to 3/99) the introduction of HAART in San Francisco and were used in the analysis.
One problem we will discuss is that the subjects were asked about their sexual practices in the last 12 months, but more or less than 12 months elapsed since the last HIV test in many cases. Another problem is that as much as 3 months may sometimes elapse between infection and seroconversion (though median seroconversion times are much less than 3 months).
We first want to talk a bit about data formats and importing. R can import data in both fixed and delimited formats. Here, we have four data sets, one for each of the four study periods we decided to analyze. There are many different ways to represent it in the computer; we chose to simply represent each data set by four different data frames. It would also have been possible to merge them into a single data frame with more variables.
Here is what the data file looked like (remember, these are false data points):
3 0 NA 0 NA NA "500555a01" 0.95 3 0 1 1 1 0 "500555s01" 1.5 3 0 5 6 3 3 "199691d01" 1.3 |
It's important to realize that for each subject in our analysis, we included only subjects who were known to be HIV negative at the beginning.
Several things should be noted about these data. First, each record (all the variables from one subject) are located on a single line, and the values are separated by spaces. This is called "whitespace separated". Second, missing values have already been coded as NA, which as you know is the missing value code that R and S use. Different statistics programs use different codes, and moreover, survey centers may use still different codes (and often in the same data set); for instance, a value of 9 may denote a missing value for some variables, a dot (.) may denote a missing value for others, and perhaps a 999 for others. And sometimes you see several different missing value codes depending on exactly what happened, that may distinguish various kinds of nonresponse. So handling missing values gracefully, and more generally cleaning the data, is not at all a trivial matter or one to be taken for granted. Before the data were loaded in to R, all these were recoded to NA, though we could easily have done the recoding in R. Third, we chose to use no header (i.e., there is no line at the top containing variable names).
Because there were 4 data sets, we chose to write a simple function to read the data in, and then apply it to four files:
> getdata <- function(fname) { + thedata <- read.table(fname,header=FALSE); + vars <- c("studywave","hivstatus","nyoung","ntot","nrai","nrai.prot","id","frac.yrs.since.test") + dimnames(thedata)<-list(thedata[,7],vars) + thedata + } |
Let's spend some time on this. Let's suppose that a little data set containing only the lines I showed you is filed as td1 (test data 1). Let's read it in:
> # continuing above example > testdata <- getdata("td1") > testdata studywave hivstatus nyoung ntot nrai nrai.prot id 500555a01 3 0 NA 0 NA NA 500555a01 500555s01 3 0 1 1 0 0 500555s01 199691d01 3 0 3 4 2 2 199691d01 frac.yrs.since.test 500555a01 0.95 500555s01 1.50 199691d01 1.30 |
> #continuing... > is.data.frame(testdata) [1] TRUE > dimnames(testdata) [[1]] [1] "500555a01" "500555s01" "199691d01" [[2]] [1] "studywave" "hivstatus" "nyoung" [4] "ntot" "nrai" "nrai.prot" [7] "id" "frac.yrs.since.test" |
> #continuing... > dimnames(testdata) <- list(c("person1","person2","person3"), c("var1","var2","var3","var4","var5","var6","var7","var8")) > testdata var1 var2 var3 var4 var5 var6 var7 var8 person1 3 0 NA 0 NA NA 500555a01 0.95 person2 3 0 1 1 0 0 500555s01 1.50 person3 3 0 3 4 2 2 199691d01 1.30 |
> #continuing... > dimnames(testdata)[[1]] <- c("aa","bb","cc") > testdata var1 var2 var3 var4 var5 var6 var7 var8 aa 3 0 NA 0 NA NA 500555a01 0.95 bb 3 0 1 1 0 0 500555s01 1.50 cc 3 0 3 4 2 2 199691d01 1.30 |
Next semester we will learn how to make functions of our own that can appear on the left hand side of assignment.
Let's now return to where we were:
> getdata <- function(fname) { + thedata <- read.table(fname,header=FALSE); + vars <- c("studywave","hivstatus","nyoung","ntot","nrai","nrai.prot","id","frac.yrs.since.test") + dimnames(thedata)<-list(thedata[,7],vars) + thedata + } > testdata <- getdata("td1") |
So normally it is better to have the variable names associated with the data set itself as a header:
studywave hivstatus nyoung ntot nrai nrai.prot id frac.yrs.since.test 3 0 NA 0 NA NA "500555a01" 0.95 3 0 1 1 1 0 "500555s01" 1.5 3 0 5 6 3 3 "199691d01" 1.3 |
> getdata <- function(fname) { + thedata <- read.table(fname,header=TRUE); + dimnames(thedata)[[1]]<-thedata[,"id"] + thedata + } > testdata <- getdata("td1") |
Of course, as always, "to err is human, but to really foul things up requires a computer" (as the saying goes). Here are a few bloopers that might happen at this stage:
> getdata <- function(fname) { + thedata <- read.table(fname,header=TRUE); + dimnames(thedata)[[1]]<-list(thedata[,"id"]) + # Error--this list component should be a vector of row names, + # not a list! + thedata + } > testdata <- getdata("td1") Error in "dimnames<-.data.frame"(*tmp*, value = list(thedata[, "id"])) : invalid dimnames given for data frame |
How about this?
> getdata <- function(fname) { + thedata <- read.table(fname,header=TRUE); + dimnames(thedata)[[2]]<-thedata[,"id"] + # Error--we're assigning to the columns, not the rows, and the + # length is worng + thedata + } > testdata <- getdata("td1") Error in "dimnames<-.data.frame"(*tmp*, value = list(thedata[, "id"])) : invalid dimnames given for data frame |
Here is another mistake you'll see:
> getdata <- function(fname) { + thedata <- read.table(fname,header=FALSE); + dimnames(thedata)[[2]]<-thedata[,"id"] + # Error--we said no header, so the header will be treated as + # data. + thedata + } > testdata <- getdata("td1") Error in "dimnames<-.data.frame"(*tmp*, value = thedata[, "id"]) : invalid dimnames given for data frame |
Ok, so we have a better version of getdata, but it still is not good enough. Now we let the data file take care of the variable names, but we still need to be sure that the variable names in the data are what the program expects. After all, what if these data are being prepared for you by a data manager who gets the variable names worng, or who uses a capital letter when you want lowercase, or something else. We'd better check and make sure we have everything we want. Suppose the data manager sent us the data set in a slightly different format, with an extra variable Index at the beginning, and with the variable studywave spelled as Studywave:
Index Studywave hivstatus nyoung ntot nrai nrai.prot id frac.yrs.since.test 1 3 0 NA 0 NA NA "500555a01" 0.95 2 3 0 1 1 0 0 "500555s01" 1.5 3 3 0 3 4 2 2 "199691d01" 1.3 |
> getdata <- function(fname,required=c()) { + thedata <- read.table(fname,header=TRUE); + dimnames(thedata)[[1]]<-thedata[,"id"] + nreq <- length(required) + if (nreq>0) { + colnames <- dimnames(thedata)[[1]] + for (ii in required) { + if (!(any(colnames==ii))) { + stop(paste("name",ii,"missing in data set")) + } + } + } + thedata + } > testdata <- getdata("td1",c("studywave","hivstatus","nyoung","ntot","nrai","nrai.prot","id","frac.yrs.since.test")) Error in getdata("td1", c("studywave", "hivstatus", "nyoung", "ntot", : name studywave missing in data set |
One other thing that was done to remove all data points that were inconsistent, in the sense of having negative numbers of partners, missing numbers of partners, or more protected partnerships than total partnerships. There are many ways to do this; I chose to write a simple function
> inconsistent.1 <- function(datum) { + sa<-datum["sa"] + sap<-datum["sap"] + tot<-datum["tot"] + sy<-datum["sy"] + ans <- (is.na(sa) | is.na(sap) | sa<0 | sap<0 | sa |
> keep.all <- function(datum0) { + ns<-dim(datum0)[1] + boolout <- rep(FALSE,ns) + for (ii in 1:ns){ + boolout[ii] <- as.vector( !(inconsistent.1(datum0[ii,])) ) + } + boolout + } |
data3to4final <- data3to4[keep.all(data3to4),] ; dimnames(data3to4final)[[1]] <- data3to4final[,"id"]; |
We wanted to use the fraction of partners under 30 to see if age mixing could be related to whatever findings we made. Unfortunately the data weren't collected at the last study period (which was actually the baseline for another study). Also, for people who did not report the ages of their partners, we wanted to use the overall fraction for that study period to impute it (what other imputation comes to mind?) So we wanted to be able to calculate the fraction of partners in each study period under 30. Here is one (admittedly pedestrian) way to do it:
> gen.overall.youngfrac <- function(dataset) { + totptrs <- 0 + totyptrs <- 0 + for (ii in 1:(dim(dataset)[1])) { + tot<-dataset[ii,]$tot + sy<-dataset[ii,]$sy + if (is.na(tot) | is.na(sy)) {tot<-0;sy<-0} + totptrs <- totptrs+tot + totyptrs <- totyptrs+sy + } + totyptrs/totptrs + } |
eprev3to4 <- gen.overall.youngfrac(data3to4final) |
It is common to get comma separated versions of data, so you may have had a data set like this:
studywave,hivstatus,nyoung,ntot,nrai,nrai.prot,id,frac.yrs.since.test 3,0,NA,0,NA,NA,500555a01,0.95 3,0,1,1,0,0,500555s01,1.5 3,0,3,4,2,2,199691d01,1.3 |
By the way, for some reason, it's not uncommon to get CSV files that might look something like this:
studywave,hivstatus,nyoung,ntot,nrai,nrai.prot,id,frac.yrs.since.test,,,,,,,,,, , 3,0,NA,0,NA,NA,500555a01,0.95 3,0,1,1,0,0,500555s01,1.5 3,0,3,4,2,2,199691d01,1.3 |
> testdata <- getdata("td1") studywave hivstatus nyoung ntot nrai nrai.prot id 500555a01 3 0 NA 0 NA NA 500555a01 500555s01 3 0 1 1 0 0 500555s01 199691d01 3 0 3 4 2 2 199691d01 frac.yrs.since.test X X X X X X X X X X 500555a01 0.95 NA NA NA NA NA NA NA NA NA NA 500555s01 1.50 NA NA NA NA NA NA NA NA NA NA 199691d01 1.30 NA NA NA NA NA NA NA NA NA NA X 500555a01 NA 500555s01 NA 199691d01 NA |
studywave,hivstatus,nyoung,ntot,nrai,nrai.prot,id,frac.yrs.since.test 3,0,NA,0,NA,NA,500555a01,0.95,,,, 3,0,1,1,0,0,500555s01,1.5 3,0,3,4,2,2,199691d01,1.3 |
> testdata <- getdata("td1") Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names |
Other problems that come up from time to time. For instance, suppose you had this file:
"Name","Diagnosis","Date" 'Doe, John',"HCV","11/11/1999" 'Roe, Jane',"HCV","11/12/1999" 'Zygill',"Pteromic Plague","11/15/1999" |
> z <- read.csv("td2",header=TRUE) > z Name Diagnosis Date 'Doe John' HCV 11/11/1999 'Roe Jane' HCV 11/12/1999 'Zygill' Pteromic Plague 11/15/1999 > z[,"Diagnosis"] [1] HCV HCV 11/15/1999 Levels: 11/15/1999 HCV |
There is one more thing we need to at least briefly mention: the fixed format data file. It is quite common in the real world to transfer data in fixed format, where each column has a meaning. Such data formats are common when dealing with very large data sets. Suppose you have a data file with four variables, and let's call them A, B, C, and D to keep this short. Variable A is in column 1, B in column 2, C in column 3, and D in columns 4-5. For instance, suppose the file td3 contains:
23A89 24Z98 3F 2 |
> thedata <- read.fwf("td3",widths=c(1,1,1,2)) > thedata V1 V2 V3 V4 1 2 3 A 89 2 2 4 Z 98 3 NA 3 F 2 > dimnames(thedata) [[1]] [1] "1" "2" "3" [[2]] [1] "V1" "V2" "V3" "V4" |
So there are many things that can go wrong in data importation in any program, and this can happen to you as an R user. (The oddest that happened to me was when text identifiers of the form 100200E09 for instance were misread as floating point data, becoming 1.002e+14.) With care and knowledge of the program, and defensive programming, you can eliminate many such errors.
Since we're considering the infectivity
to be the same for each person and for each partnership, we could
begin to compute the infection probability as follows. For the
simple model, let's neglect the age of the reported partnerships
for now. Each subject reported, for the last 12 months, the number of protected
partnerships and the number of unprotected partnerships (remember
the subject reported RAI (receptive anal intercourse) for these
partnerships). Let's call the number of unprotected partnerships
n and the number of protected partnerships m.
Now, to begin with, suppose that the last test was conducted not quite
a year ago. If it was conducted 9 months ago, then 0.75 of a year
has elapsed since the test; if it was conducted 11 months ago, then
11/12=0.917 of a year elapsed since the test. So let
be the fraction of a year that has elapsed; thus, we're assuming
that
is less than 1.
Now, we're going to neglect the few to several weeks that elapse between infection and seroconversion in the model to follow. We believe this to be not large compared to one year; it will introduce some misclassification into the model.
Assuming the partners to be spread over the year at random, we could
assume that a given reported partnership has a
chance of having happened since the test. So what is the chance of
seroconverting? For one reported partnership, the chance of seroconverting due to it is the chance it happened since the last test, times the chance
the partner is infected, times the infectivity. If we denote the
chance the partner is infected by p, then the chance of
infection for that reported partner is
p
. So
the chance of not being infected for that partnership is
1-
p
.
Then the chance of not being infected for all the unprotected partnerships (of which there are n) is
(1-
p
)n.
Now, what to do about the m protected partnerships? Let's say
that the infectivity is smaller if the person reports always using
condoms. So we'll multiply down the infectivity by a factor
which is less than or equal to 1. So if
is 0.05, then you only have 5% the chance per partnership with an infected
person of becoming infected, etc. If we use the same simple assumptions
as above, then we get the chance of not being infected for all
the protected partnerships as
(1-
p
)m.
So the overall chance of not being infected is
(1-p
)n × (1-
p
)m,
and the chance of being infected is
1-(1-
p
)n × (1-
p
)m.
What do we do if more than one year has elapsed since the test? For
instance, suppose it's been 18 months since the test; then the
actual risk period is 6 months (0.5 yr) longer than the time
since the test. We call this period of excess risk
.
If the person says they had (say) 6 partners in the 12 months, you
might think that the person might be having a partner every 2 months.
So you might think that there were really an extra 3 partners since the
test, and we are going to have to come up with some sort of simple
adjustment for this. Now as it happens, just this simplest sort of
imputation worked quite well, but we actually did slightly more
involved imputation. We modeled the chance of not being infected
during the 12 months just the same way as above (with
=1). Then we modeled the chance of escaping infection in the remaining
period (the time beyond 12 months, since the test) as
exp(-
pn
)
for unprotected partnerships, and
exp(-
pm
)
for protected partnerships. This was based on assuming that the
observed rate of (self-reported) partnerships in 12 months is our
best estimate of the rate that partnerships occur in the remaining
risk period, and that given this rate, the partnerships occurred
according to the Poisson distribution (each small interval of time
had the same small chance of a partnership occurring in it). (We tried
other more complex imputation schemes and found that it made
essentially no difference.)
So based on this simple model, we could compute the likelihood function of the data. Because (in part) of the small numbers involved in such probabilities, we computed the logarithm of the likelihood.
It is important to realize that in this model, prevalence and infectivity are completely nonidentifiable. In the paper, we discuss the issues involved in trying to assume reasonable bounds on the possible prevalence change over the study period and what this means in terms of statistical inference on the value of the infectivity.
So let's construct a loglikelihood function. In will go a data
set and the parameters, and out will come the log-likelihood.
The parameters we're going to solve for (the infectivities for each
of the four study periods, and the condom protection value theta ()) will be located in the argument inpars. The first four
elements of the vector inpars are the four period-specific
infectivities, and the fifth element is theta. The second
argument dataset is the data set we're looking at, the fourth
argument eprev is a vector of study period specific
fractions of partners under 30. That leaves the third argument
which we called inprev.z, which is a matrix of period and
age-group specific prevalences of HIV infection, as shown.
> # several potential age-group/period specific prevalences > # row is age group, column is period (3 is first period of > # our analysis > inprev.1 <- rbind(c(0.23,0.23,0.23,0.23,0.23,0.23), + c(0.23,0.23,0.23,0.23,0.23,0.23)) > inprev.2 <- rbind(c(0.23,0.23,0.23,0.22,0.21,0.20), + c(0.23,0.23,0.23,0.22,0.21,0.20)) > simlik.1<-function(inpars,dataset,inprev.z,eprev) { + wave <- as.integer(dataset[1,"studywave"]) + beta <- inpars[wave-2] ; + theta <- inpars[5] ; + phi<-pmin(1,dataset[,"frac.yrs.since.test"]) + tau<-pmax(1,as.vector(dataset[,"frac.yrs.since.test"]))-1 + sy <- as.vector(dataset[,"nyoung"]) + tot <- as.vector(dataset[,"ntot"]) + yfrac <- sy/tot + yfrac[is.na(sy) | is.na(tot)] <- eprev[wave] + prev <- inprev.z[1,wave]*yfrac + inprev.z[2,wave]*(1-yfrac) + bigb3 <- 1-phi*beta*prev ; + bigc3 <- 1-phi*beta*prev*theta ; + nu <- (dataset[,"nrai"]-dataset[,"nrai.prot"]) + np <- (dataset[,"nrai.prot"]) + ninf <- bigb3^nu * bigc3^np * exp(-tau*beta*prev*(nu+theta*np)) + ninfm <- 1-ninf + xx<-(dataset[,"hivstatus"]) + ll<-sum(log(ninf*(1-xx)+ninfm*xx)) + ll + } |
Let's go through this line by line.
simlik.1<-function(inpars,dataset,inprev.z,eprev) { wave <- as.integer(dataset[1,"studywave"]) beta <- inpars[wave-2] ; theta <- inpars[5] ; |
simlik.1<-function(inpars,dataset,inprev.z,eprev,wave=as.integer(dataset[1,"studywave"]) { beta <- inpars[wave-2] ; theta <- inpars[5] ; |
phi<-pmin(1,dataset[,"frac.yrs.since.test"]) tau<-pmax(1,as.vector(dataset[,"frac.yrs.since.test"]))-1 |
> z <- c(0.2, 0.3, 1.5, 0.6) > min(1,z) [1] 0.2 > pmin(1,z) [1] 0.2 0.3 1.0 0.6 |
> z <- c(0.2, 0.3, 1.5, 0.6) > max(1,z) [1] 1.5 > pmax(1,z) [1] 1.0 1.0 1.5 1.0 > pmax(1,z)-1 [1] 0.0 0.0 0.5 0.0 |
The next lines simply compute a vector of the fractions of partners under 30 for each person, imputing missing values from the overall fraction for that wave. Note that eprev must be a vector; the i-th element of the eprev-vector is the young partner fraction for study period (wave) i. Finally, the weighted prevalence of infection for the partners of each person is computed in the last line, and called prev.
sy <- as.vector(dataset[,"nyoung"]) tot <- as.vector(dataset[,"ntot"]) yfrac <- sy/tot yfrac[is.na(sy) | is.na(tot)] <- eprev[wave] prev <- inprev.z[1,wave]*yfrac + inprev.z[2,wave]*(1-yfrac) |
Finally, we compute the log-likelihood as follows. We compute (as vectors) the escape probabilities for each individual, for protected partnerships (bigc) and unprotected partnerships (bigb). We then get the number of unprotected partnerships (nu) and protected partnerships (np); note that these are all vectors. Finally, we compute (again, as a vector, a value for each subject) the noninfection probability for that wave as ninf, and the infection probability ninfm as one minus that. The vector of values xx is the seroconversion status for each subject. When the subject is infected, the infection probability is computed; when the subject is not infected, the noninfection probability is used. This is calculated in the ll line; the log of each element is calculated, and then all these values summed up. This reflects the assumption that the subjects are independent, so the total likelihood is the product of the likelihood of each individual subject; the log of the product is then the sum of the logs. And this value is what is returned from the function.
bigb <- 1-phi*beta*prev ; bigc <- 1-phi*beta*prev*theta ; nu <- (dataset[,"nrai"]-dataset[,"nrai.prot"]) np <- (dataset[,"nrai.prot"]) ninf <- bigb^nu * bigc^np * exp(-tau*beta*prev*(nu+theta*np)) ninfm <- 1-ninf xx<-(dataset[,"hivstatus"]) ll<-sum(log(ninf*(1-xx)+ninfm*xx)) ll } |
Next, how do we use this function? This computes the log-likelihood for one data set; there are four data sets. We compute the overall log-likelihood in a simple way:
> loglik <- function(inpars,d3,d4,d5,d6,inprev.zz,eprev) { + if (any(inpars<0) | any(inpars>1)) { + ans <- -1e200 + } else { + ans <- simlik.1(inpars,d3,inprev.zz,eprev) + + simlik.1(inpars,d4,inprev.zz,eprev) + + simlik.1(inpars,d5,inprev.zz,eprev) + + simlik.1(inpars,d6,inprev.zz,eprev) + } + ans + } |
Then we can simply perform the optimization using
> answer <- optim(c(0.11,0.11,0.056,0.043,0.04), + function(arg)loglik(arg,data3to4final,data4to5final,data5to6final,data6to7final,inprev.1,eprev) + control=list(fnscale=-1)) |
> answer $par [1] 0.11750527 0.12412787 0.05520114 0.04388523 0.05430423 $value [1] -97.31828 $counts function gradient 462 NA $convergence [1] 0 $message NULL |
What we next did was to constrain the first two betas to be the same, and the last two, and solve for the resulting three values using the optimizer. Finally we constrained all the betas to be ths same, and found the best single value for it. The difference between these two reflects how important it is to the log-likelihood to be able to choose a different value early and late. If this constraint does not matter much, then there is not much evidence that the infectivity changed. We performed the likelihood ratio chi square test to test the hypothesis of constant infectivity, given our assumptions.
One way of viewing these data is to construct a kind of profile likelihood
surface. We first back up and recall that the actual transmission
probability per partnership was the product of the prevalence and
the infectivity. These two are not identifiable, but their product can
be estimated. We can then compute the infectivity if we know the prevalence,
and this is in effect what is happening. Let's aggregate the first two
waves (study visits) and the last two, so that we have really three things
to estimate from the data here: transmission probability per contact for the
first two periods, for the last two, and also the condom protection factor.
Suppose that we just pick the transmission probabilities, and then with those
fixed, find the best value of the condom factor. Once we have the value of
the condom factor that maximizes the likelihood, we compute the value of the
likelihood given the assumed transmission probabilities and the estimated
condom value. Let's then plot this value of the (log) likelihood.
A contour plot looks like this:
What does this mean? For each pair of transmission probabilities, we have the best possible likelihood we can get (the largest value possible, found by picking the condom protection factor to make it as big as possible). There is a peak labeled with a cross, as you can see. That value represents the maximum likelihood estimate of the transmission probabilities, and notice that the value post-HAART is substantially lower than pre-HAART. We believe this difference to be due to drugs, but this cannot be proven based on these data alone (there is no control group and no possible way to assess the exposure of the partners of the men in the study to the drugs.)
Now, suppose that we assume that the prevalence is the same pre- and post-HAART.
Let's now constrain the infectivity
to be the same. So with the prevalence the same and the infectivity the
same, we have to have the transmission probability be the same before and
after HAART. In other words, we have to stay on the 45-degree line on the
graph, shown on the figure in solid black. Given that, where is the
maximum likelihood estimate? It is at the black dot you see on it; if you
stay on that black like, that is the highest elevation you can get to and it
represents the best single (lumped) estimate of the transmission probability
per partnership. If the height you can get to is not far below the very
best (unconstrained) height (at the cross), then the constraint didn't make
much difference in a way. If that were to be the case, the data would
provide no evidence in favor of a difference. On the other hand, if having
to stay on the solid line dramatically lowered the best (greatest) likelihood
you could achieve, then that is evidence that you shouldn't have to stay
on the solid line: you would reject the hypothesis of constant infectivity
(given constant prevalence).
So which is it? We indicated contours that correspond to hypothesis rejection at different alpha levels (by the likelihood ratio chi-square test), and that is what you see on the graph. The solid dot (the maximum likelihood estimate of the transmission probabilities given the hypothesis of no infectivity difference, and constant prevalence) is between the 0.05 contour and the 0.01 contour, so the p-value is between 0.05 and 0.01. So at a level of 0.05, we reject the hypothesis of constant infectivity.
Now, suppose that the prevalence increased post-HAART, since the death rate
due to AIDS decreased. Now, if we assume that the infectivity is the same
pre- and post-, what happens? We're assuming the prevalence after HAART
p2 is less than the prevalence
p1 pre-HAART. If we pick some infectivity
, we
can then compute the transmission probability per partnership before and
after HAART, respectively,
p1 and
p2. If we do this for a lot of different values
of the infectivity and plot the answers, we get a straight line, not at 45-degrees, but sloping upward at a steeper angle (since the prevalence has been
assumed to increase but the infectivity is the same, the actual
transmission probability per partnership just has to go up.) Given this
constraint, what is the best value? You find the highest point on the
transect, and see how much lower it is than the overall peak. In our
case, the steeper this slope, the lower it is, indicating that if the
prevalence actually increased, it is less plausible that the infectivity is
the same pre- and post-HAART. In particular, the dotted line A (at
to a rejection threshold of 0.001) corresponds to a 57.7% increase in
prevalence (but that is really too big to believe). The dotted line B
(at rejection threshold of 0.1) corresponds to a 17.2% increase in prevalence.
This goes the other way too. If we assume a prevalence decrease after HAART (maybe people who were negative disproportionately and greatly increased their risk behavior, diluting the positives...but there is no evidence for this mechanism and we looked at our cohort at least). Now, the solutions are constrained on a line with a shallower slope than 45-degrees. The peak along these transects is higher, indicating that there would be less evidence against the hypothesis of constant infectivity. In other words, we would be partially explaining the findings by assuming a decrease in prevalence. Assuming a 9.3% decline in prevalence takes us to the rejection contour of 0.05; if we believe the prevalence decreased by more than 9.3% after HAART was introduced, we can no longer reject the hypothesis of constant infectivity (line D). Finally the 0.10 rejection contour is reached by assuming a 20.4% decline in prevalence.
I'll show how these were constructed in R in the future; this was a somewhat numerically intensive calculation. Crude contour plots can be made with the function contour which I invite you to explore.
Another thing that we did was to compute bootstrap resamples of the data set each wave, and for each resampled data set, used the same algorithm to find the best estimate. One problem, however, arises: in a bootstrap resample, there might be no seroconverters at all, and then the best estimate for the infectivity is zero, right on the boundary. In these cases we had to eliminate this particular beta from the optimization, since we already know what it is.
Another part of the final analysis was to assume a different per-partnership infectivity for a subset of the population, and compute simulated infectivity data based on this assumption. We then fit the simulated data and computed how often we falsely concluded there was a infectivity drop due solely to frailty selection bias (using up the people most likely to be infected, causing the average in the cohort to fall).
In the future we'll add some of this code to this page; more details and complete references can be found in the paper2.
In the next lecture, we'll look at a SARS preparedness analysis programmed by Tomas Aragon.
1Osmond DH, Page K, Wiley J, Garrett K, Sheppard HW, Moss AR, Schrager L, Winkelstein W Jr. HIV infection in homosexual and bisexual men 18 to 29 years of age: the San Francisco Young Men's Health Study. American Journal of Public Health 84:1933-1937, 1994.
2 To appear.
3Grant RM, Wiley JA, Winkelstein W. Infectivity of the human immunodeficiency virus: estimates from a prospective study of homosexual men. Journal of Infectious Diseases 156: 189-193, 1987.