Over the last 15 years sexually transmitted infection/human immunodeficiency virus (STI/HIV) prevention researchers have increasingly called for implementation of multilevel, multicomponent, structural and social interventions aiming to modify environments that place populations at risk.^{1–10} The enthusiasm for multilevel interventions is based on recognition that economic, political, and social context shapes sexual behavior. However, interventions that aim to change social structures and environments using multiple prevention components are notoriously difficult to design, implement, and evaluate.^{6,11} The gold standard for evaluation of community focused interventions to change social environments or structures is the randomized community trial. Given the cost of randomizing communities to a multilevel intervention, such efforts should only be undertaken based on findings from rigorously designed observational studies, to ensure that strategies tested in randomized community trials are those most likely to be effective.

While observational research has been considered a less rigorous alternative to the randomized trial, observational designs have many strengths for evaluating prevention programs and examining the magnitude of their effects. Following a single community cohort over the course of a prevention program offers a rigorous and cost-effective means to evaluate multilevel interventions, and, by using cutting edge statistical tools, it can provide estimates that are as close as possible to trial results without the cost. Randomizing individuals in a community is often difficult: one cannot enter a community and randomize friends and neighbors to participate or not in public forums, community events, or other mobilization efforts. Instead, evaluation in this context means comparing individuals in the community who participate to those who do not (or do so less). The challenge becomes comparability of the participants to the nonparticipants (internal study validity). Fortunately, there are weighting methods that can correct for the biases found in nonrandomized studies and permit valid comparison.

This article outlines the evaluation of a multicomponent STI/HIV prevention project with the application of weighting methods to observational cohort data. The method, inverse probability weighting (IPW), is one estimating tool from a class of techniques called marginal structural models^{12–18} that facilitate causal inference with observational data.^{19,20} IPW was first applied in clinical epidemiology as an analysis tool to address complex confounding structures in longitudinal AIDS treatment trials. The weighting methods have been well described in the AIDS treatment literature.^{13,21–23} Whereas the complex theoretical and statistical underpinnings of these methods may have limited their accessibility initially, IPW is an invaluable tool in dealing with selection bias, loss-to-follow-up, and complex confounding in evaluation of prevention programs. We hope to facilitate wider use of IPW in STI/HIV intervention research by demonstrating, step-by-step, how IPW can be used in prevention studies.

#### MATERIALS AND METHODS

##### Data Source: Encontros

From 2003 to 2005, Encontros, or “coming together,” aimed to decrease incident STI and encourage adoption of consistent condom use among female, male, and transvestite sex workers in Corumbá, Brazil, by using joint clinical and social intervention strategies. The project was designed to engage the sex workers on an individual level through participation in STI/HIV counseling and testing, on an interpersonal level through peer-education, and on a community level through outreach and social activities. Community-based activities were designed to extend and strengthen collegial relationships through providing sex workers opportunities to engage in dialogue around sex work, discrimination, human rights, and prevention.^{24}

Participants included 420 men, women, and transvestites, 18 years of age or more, self-identifying as sex workers, who spoke Portuguese or Spanish and did not plan on leaving the study area permanently in the month following recruitment. Participation included an enrollment visit and 4 scheduled follow-up visits at 3, 6, 9, and 12 months following enrollment. Each visit included administration of a structured, interviewer-administered questionnaire, counseling for STI/HIV, collection of biologic samples for STI testing, treatment for STI if indicated, and testing for HIV if requested. Intervention activities were ongoing during rolling enrollment, follow-up visits, and through the final month of participation. Chlamydia and gonorrhea were diagnosed using COBAS AMPLICOR CT/NG PCR (Roche Molecular Diagnostics, Pleasanton, CA).

All enrolled participants were encouraged to participate in project sponsored events; as a result there is no “control” population. Instead we compare participants who were more and less exposed to the intervention, with the hypothesis that sex workers who actively participated in the project would present with fewer incident cases of chlamydia and gonorrhea. Project participation varied at each time point and included contact with peer educators and counselors and participation in community cultural or social events, workshops, educational activities, and organizations/associations. Participation was summarized into a dichotomous indicator of low or high participation at each visit. Low exposure describes those who attended scheduled appointments but who participated very little in project activities. The high exposure group attended scheduled appointments and had more contact with educators or counselors and participated in project events, workshops, or organizations.

##### Biases and Confounding

Participants in social or community interventions such as Encontros often can not be randomized. Instead, they choose how much to participate (or self-select) based on factors often associated with access to services, behaviors, and infection. For example, people who chose to participate more in Encontros activities tended to be older and already participating in community groups. Comparison of people participating more and less is likely to be confounded. Similarly, participants who remain in a prevention program or a research cohort are different from those who do not. Over half of the 420 participants in the Encontros cohort were lost-to-follow-up before completing a year of follow-up. Those who remained in the study were more likely to be married, were less mobile, and reported more STI symptoms. This loss-to-follow-up can be thought of as a missing data problem (absence of data on those who left) or as a selection bias problem (people who stay are different from those who do not). Estimates of effect based only on people who remain in the cohort may not reflect the true program effect on all participants.

Furthermore, in longitudinal studies one may also find time dependent confounding, whereby past outcomes, covariates and exposure to the intervention may impact subsequent exposure and subsequent infections.^{13,14,17} Attempting to eliminate the effects of time dependent confounding by treating prior participation as a confounder in a regression model can result in bias by controlling away the effect of interest. However, failure to account for prior participation can also bias the effect estimate. In this study, we explored confounding within intervals of data as well as time dependent confounding. The weighting approach presented herein can mitigate self-selection, loss-to-follow-up and confounding, including time dependent confounding, when relevant.

##### Weighting to Balance Populations

Our aim is to assess the effect of the intervention on incident STI among sex workers. We estimate the intervention effect by comparing the odds of presenting with an incident STI if all sex workers in our study population were a high participator, as compared to the odds of an STI if everyone were a low participator at each time point. This can be described as the marginal probability of the outcome under the different levels of participation (often generically called “treatments” in the causal inference literature). We assume that the characteristics that predict participation in the intervention are the same over all time points, or intervals, so we simply pool data over time, analyzing all of the time points together in one model.

Hypothetically speaking, to do away with the biases in this longitudinal data set, one would need to randomize people to low or high participation levels at every visit to ensure that past experiences and covariate distribution had no bearing on the exposure/outcome relationship at each interval. We simulate this repeat randomization using weights. Inverse probability weighting for exposure (or in this case for participation) assigns a weight for each subject equivalent to the inverse probability of being in his/her participation group, based on values of covariates, past outcomes, and exposures at each interval. This treatment probability is estimated using regression techniques. The exposure weights are then applied to the observed population, creating a new pseudo-population in which the joint distribution of confounders is balanced between the 2 exposure groups, essentially mimicking a randomization procedure at each time point.^{17}

This weighting technique can also be used to deal with the issue of loss-to-follow-up or censoring. We assign each participant a weight equivalent to the inverse probability of remaining in the study at each interval, based on values of observed covariates and past outcomes and exposures. Probabilities are estimated using basic regression techniques. The censoring weights are applied to the observed population, creating a new pseudo-population in which censored subjects are “replaced” by up-weighting uncensored subjects with the same values of past exposures and covariates. Censoring weights and treatment weights are combined (multiplied) into final weights to allow for simultaneous adjustment.

##### Weighting Procedures

##### Data Management.

Cohort participants come for repeated visits; the data need to be in long or stacked format (Table 1) and outcomes must be linked with time points to create treatment weights and run analyses. In the Encontros study, not all subjects adhered to their scheduled visit dates, resulting in nonuniformity in data collection. Because visits were to take place every 3 months, we divided data into 3-month intervals, centered on the scheduled time points of 3, 6, 9, and 12 month visits, such that each participant's first visit was equivalent to time “0” and subsequent visits between 45 days and 135 days later (3-month interval of 90 days centered on the next scheduled visit) were included in interval 1, and so on. We include 5 intervals of data in this analysis.

Table 1 displays a spreadsheet view of how the data should be set up for a weighted longitudinal analysis. A row has been created for each participant for each interval. Every row indicates whether the participant was present (used to create censoring weights) and, if so, his/her participation level and outcome. Note that there are no exposure data for the first (enrollment) visit; data for analyzing the exposure/outcome relationship begin in interval 1.

##### Variable and Model Selection.

Once the data have been arranged, the next step is to model the association between confounders and the exposure (i.e., intervention participation.) When there is loss-to-follow-up or censoring, the association between confounders and presence at each visit must also be modeled. These models are the first step in calculating the weights and are generated using so-called black box techniques designed to do a good job predicting intervention exposure without concern about interpretability, at once attempting to trade-off variance and bias in the model predictions. Just as in a typical regression analysis, variables selected to control for confounding in these treatment models should be potential confounders related to the outcome. Including variables that are strongly related to treatment but not the outcome can severely hurt estimation. The resulting model has 2 elements: (1) the set of variables chosen, and (2) the functional form of these variables (e.g., main effects, polynomial terms, multiplicative interactions). These models can be chosen by automated algorithms or by expert knowledge, although expert knowledge can rarely identify the proper functional form of the variables.

Most investigators confront missing data. We chose to carry forward values for those covariates deemed not to change drastically in the period of 6 months. For example, we considered that self-efficacy for condom use would not change at every visit and therefore need not be measured at every visit. To include the self-efficacy variable in the treatment and censoring models, values were carried forward from the last visit. Other methods of data imputation are available.^{25} Variables listed in Table 2 were considered for inclusion in the weighting models.

Once a group of variables has been identified for weighting models, various approaches can be taken to model specification and selection.^{26,27} For these data, we constructed 4 sets of weights for treatment and censoring using different criteria for each (Table 3). First, we chose covariates based on expert knowledge of the subject matter. The second approach was based on an automated algorithm in the program R^{28} called the deletion substitution algorithm (DSA). DSA is a data-adaptive estimation procedure based on cross-validation, which chooses among models trying to minimize the mean-squared error of prediction.^{29,30} The third approach utilized another automated algorithm in R called polyclass, which builds a hierarchical set of models with main effects, linear spline terms and interaction terms and uses cross-validation to determine the optimal model size.^{31} We also constructed weights using a stepwise selection process, including all first order variables (main effects) and fractional polynomial terms (for continuous variables) with a *P*-value <0.1.

There is no single best way to perform model selection. Some model selection routines require extensive programming and may only be accessible with substantial help. We recommend trying a variety of approaches and a variety of models; using different permutations of an expert knowledge model or stepwise approaches is a reasonable alternative to automated selection techniques. If results differ substantially by model selection procedure, further work is required to determine how and why the weights differ. Finally, one can derive inference using robust standard errors or techniques such as bootstrapping.^{32}

Weights should be assessed for their range–giving anyone a weight of 100 in a study with only 100 observations would mean that one person is contributing a large portion of the information towards the estimate of the parameter of interest. Extreme weights can be truncated; one can also truncate or transform variables that might be contributing to extreme or unstable weights, or one can restrict analyses to a subset of observations (see Experimental Treatment Assignment [ETA] assumption discussed below). One can also use so-called stabilized weights.^{17} Following sensitivity analyses for different truncation points, we chose to truncate our weights at a value of 20 to avoid overrepresenting any single observation or permitting extreme outliers to have very large weights.

##### Assumptions

The models we use require meeting a set of assumptions about the sampled data with respect to the larger population. In the case of IPW, the requisite assumptions have been laid out in detail.^{17,26} First, one assumes there is no unmeasured confounding. There is no way to empirically test for no unmeasured confounding; collection of data on a complete set of covariates should be incorporated in the design phase. For the Encontros study, we collected data on a broad and inclusive set of measurable covariates.

Second, time-ordering is necessary for casual inference. Because there is no way to know exactly when someone acquires an infection and because we cannot collect samples every day, we make the assumption in this analysis that variables reflecting activities, behaviors, and the social environment over the past 3 months predate infection at the end of the period (samples were collected at the end of each interval). Additionally, we have only included participants who were treated for previous infections, such that they are at risk in every interval included in the data.

Third, Experimental Treatment Assignment (ETA) or positivity assumes that groups defined by all possible combinations of covariates must have the potential to be in any (either) of the treatment groups. If there are covariate groups that will only be observed in one treatment state, then we cannot estimate the effect of the exposure within that group. There is also the practical ETA violation, which occurs when there are too few subjects who have one of the treatments in a covariate group to estimate the treatment effect in that group. Fundamentally ETA is not a problem of the IPW estimator, but a problem pertaining to covariate groups for which there is no experimentation and is, therefore, of issue across estimation approaches. However, most typical regression analyses will not provide obvious alerts of this problem (it is invisible to these techniques), whereas the IPW estimator can “blow-up” as weights get very large. This is an advantage of the IPW estimator, though often disappointing to the researcher because it raises a red flag about a lack of information in the data to answer the question of interest. In our data, using 3 of 4 modeling methods almost all participants had >0.05 chance of being in the high participation group. One can use rules of thumb to decide the importance of the bias (such as weights >20), though there are also established computer intensive methods to examine bias due to a potential ETA violation.^{33}

#### RESULTS

Overall, 420 cohort members contributed to 1346 data collection visits, of which 1306 occurred within the specified time frame for this analysis (up to 16 months following recruitment.) Valid results for chlamydia and gonorrhea testing were available for 1262 visits, of which 848 visits occurring during the intervention (the other 414 results were results from the enrollment visit). Eight observations were excluded as infections were not treated at least 30 days before the subsequent visit (the participant was not at risk of infection). As such, there are 840 observations from 333 cohort members contributing to this analysis.

Modeling procedures, resulting weights, and estimates of effect are presented in Table 3. Because there are repeated measures, the final analysis was performed using generalized estimating equations to account for the nonindependence of repeated measures on individuals.^{34} The crude estimate of effect was 0.51 (0.26–0.99): the odds that high participators would test positive for chlamydia or gonorrhea are half the odds of low participators. Estimates of the odds ratio generated using all of the modeling approaches are similar, between 0.43 and 0.53, and on the whole significant. Our results suggest a protective effect of participating in the Encontros intervention, reducing incident infections by an estimated 50%. Detailed results will be presented in a separate paper.

The weights generated using expert knowledge and the polyclass algorithm (Table 3: models 1 and 3) had a more restricted range. The DSA and stepwise approaches (models 2 and 4) yielded a broader range of weights, with some people having a very low probability of treatment. A smaller range of weights can imply fewer violations of ETA, but may imply more bias in the estimates of the treatment/censoring model. There is no theoretically best way to choose a modeling approach. Again, we recommend using multiple approaches.

We also ran a more traditional analysis using generalized estimating equations to account for nonindependence and controlling for confounding with pooled multivariable logistic regression. Controlling for covariables utilized in the “expert knowledge” model that remained significant at *P* < 0.1 in multivariate analyses (including age at enrollment, self-efficacy, sex work outside of study city in the past 5 years, individual income, and number of unprotected sex acts in previous week), the estimate of effect was OR: 0.68 (95% CI, 0.34–1.3). This estimate is not directly comparable to the estimates in Table 3, which are marginal estimates of intervention effect over the distribution of covariates in the study population. Instead, this should be interpreted as the conditional effect of the intervention for a small subgroup of the study population: the youngest members of the cohort, with mean levels of self-efficacy, who did not work outside of the study area in the past 5 years, with the lowest tertile of income, and with no unprotected sex acts in the previous week.

#### DISCUSSION

The hegemony of the randomized trial design has dampened enthusiasm for alternative designs that are sometimes more appropriate and definitely less expensive for evaluating STI/HIV prevention programs. As the statistical methods described in this manuscript become increasingly accessible, observational designs should experience a revived interest for evaluating STI/HIV prevention, especially as renewed calls for an evidence base in multifaceted social and structural prevention initiatives gain momentum.^{11,35,36} The methods presented herein to reweight the study population provide a means to use longitudinal, observational study designs for causal estimates of programmatic effects, correcting for the biases common in observational studies. Weighting has not been used widely in prevention research to date; we believe that evaluations of behavioral, social, and multilevel interventions can benefit from use of these methods.

In the Encontros study, despite the likelihood that self-selection and loss-to-follow-up would skew results, there was less bias than expected. There were some differences between participants who remained in the intervention and those who were lost and between those who participated and those who did not, but overall those discrepancies did not substantially alter the estimates of intervention effect. Because we corrected the potential biases through application of weights and used various model selection approaches, we can infer that our findings are not likely due to uncontrolled confounding and that the intervention reduced incident infections among actively participating sex workers. Even so, the possibility of residual confounding, particularly from unmeasured factors, cannot be dismissed.

By using IPW we generated marginal estimates of effect: the average intervention effect across the entire study population, as if all participants were in the high exposure group compared to the counterfactual of having all participants in the low exposure group. (Note that unlike in linear or log linear models, even in the absence of interaction, the conditional OR does not represent the marginal effect due to the non-collapsibility of the odds ratio.^{37}) This marginal estimate is more useful in terms of planning for a public health impact than conditional estimates, or those generated controlling for covariates. We ran a conditional multivariate analysis for purposes of comparison and found a very different odds ratio (OR: 0.68). It is not surprising that this fundamentally different effect estimate, which describes the intervention effect in only a subset of the study population, was not very close to the marginal estimates in Table 3. Conditional estimates are often reported without acknowledgement that they are estimates of association or effect in a small subset of the study population. Had we run a traditional analysis (nonweighted) we may have interpreted our findings to indicate that the intervention was not successful. Care needs to be taken in reporting conditional estimates and interpreting them as global effect estimates. We also note that this traditional approach relies on the dubious assumption that an a priori specified model is correct. Methods using machine learning to describe the treatment mechanism, such as the DSA described herein, make no such (extremely important) assumptions.

Additional estimation approaches are available to balance observational data, including the G-computation algorithm, which can be used for continuous exposures and is based on imputing the outcomes for individuals keeping all covariates fixed at their original values but modifying the treatment variable of interest.^{38,39} Another iteration of the weighting approach is the use of propensity scores. With propensity scores the probability of treatment is modeled for each participant, much the same as it is with IPW; however, the resulting score for treatment probability is used to stratify populations or match participants to allow for estimation of association or effect within comparable strata or matched pairs.^{40,41} These estimation approaches can also be applied in other observational studies and in trial designs. For example, IPW can rebalance study populations at different time points in serial cross sectional studies designs in order to estimate intervention effects while adjusting for secular changes in population. In trials, estimation approaches such as IPW and G-computation can be applied to account for post-randomization differences in distribution of important covariates (i.e., empirical confounding).

Causal inference methods, including marginal structural models, have rarely been presented in forums specific to STI prevention. IPW is one approach that can help improve the quality of and maximize the use of data generated in observational studies and program evaluations of STI/HIV prevention. Given limited resources, rigorous evaluations of observational community-based interventions using these methods may provide more information regarding successful intervention approaches than funding a limited number of RCTs. Application of IPW is not beyond prevention researchers; however, users should be attentive about meeting the assumptions necessary to apply the methods correctly.