The consequences of misclassified binary outcome or exposure variables when estimating a crude odds ratio (OR) are well understood.^{1–5} Existing literature also covers the use of validation data to estimate crude ORs while adjusting for misclassification in case-control and cross-sectional studies,^{6–11} considering the relative merits of external versus internal validation study designs.^{1,11,12} In regression applications, many researchers advocate the use of validation data to adjust for measurement error in continuous predictors.^{13–17}

Regarding outcome misclassification for discrete responses, Magder and Hughes^{18} outline the problem under logistic regression and advocate maximum likelihood using an expectation-maximization algorithm.^{19} Their work primarily addresses the case of known misclassification probabilities (ie, sensitivities and specificities) characterizing the observed outcome variable. While continuing to focus on the known sensitivities/specificities case, Neuhaus^{20} provides further insight into asymptotic bias and efficiency in the broader realm of the generalized linear model, as well as a more efficient computational maximum likelihood approach. Recent articles in the epidemiologic literature demonstrate Monte Carlo-based techniques that similarly facilitate sensitivity analyses with misclassified binary variables.^{21,22}

Other related research includes extensions to settings with count or discrete survival outcomes.^{23–25} To incorporate validation data, some authors gravitate toward Bayesian approaches using prior assumptions about misclassification probabilities.^{26–28} From the parametric frequentist perspective, Carroll et al^{11} provide general expressions for likelihood functions that accommodate internal validation data. Alternative developments include robust modeling of sensitivity and specificity via kernel smoothers,^{29} with comparisons of that approach versus parametrically modeling their dependence upon covariates.^{30}

Our aim is to provide guidance for epidemiologists seeking accessible and efficient methods for obtaining validation data-based estimates of logistic regression parameters when the outcome is misclassified. We keep to a likelihood-based approach, as it avoids explicit specification of prior distributions and is readily facilitated for binary outcomes. In the general case, we model the dependence of sensitivity and specificity upon covariates via a second logistic regression model, promoting a flexible and intuitively appealing analytic approach.

The methodology that we illustrate is a direct expansion of the known misclassification rate setting considered by Magder and Hughes^{18} and Neuhaus,^{20} a covariate-adjusted extension of well-discussed methods for estimating crude ORs,^{6–9} and ultimately an application of the general main/validation study maximum likelihood approach outlined in Carroll et al.^{11} However, there have been few if any real-world applications of the latter approach making use of internal validation data, and such application presents computational challenges to the practicing epidemiologist. Therefore, our goal is to bring this approach for addressing outcome misclassification in regression closer to the forefront of epidemiologic research. We pursue this aim by highlighting an instructive example involving misclassified outcome status in the HIV Epidemiology Research Study, by transparent exposition of appropriate likelihood functions, and by providing appendices with straightforward computer code that connects directly with that exposition.

## METHODS

Assume we wish to fit the following logistic regression model to cross-sectional data (we discuss implications of case-control sampling later):

We use the symbol τ for easy reference to Eq. (1) in Appendix 1. Instead of the true (0, 1) response Y, suppose the primary (main) study relies upon an error-prone (0, 1) alternative Y*. It is known that misclassification in Y* potentially invalidates estimates of (β_{0}, …, β_{p}) based on the “naive” model that replaces Y by Y* in Eq. (1). The magnitudes and directions of biases in the naive estimates depend upon the diagnostic properties of Y* as a substitute for Y.^{18}

In the nondifferential case,^{1} the critical diagnostic properties boil down to 2 parameters, sensitivity (SE) and specificity (SP):

If misclassification is differential, however, then sensitivity and specificity can vary according to subject-specific variables, making effects of misclassification less predictable.^{1} Thus, we define

where the vector **X** is usually some subset of (X_{1}, X_{2}, …, X_{p}).

### Sensitivity Analyses

Suppose first that no validation data are available so that one has only main study data consisting of (y_{i}*, x_{i1}, …, x_{iP}) on the ith experimental unit (i = 1, …, n_{m}). In this case, each independent record contributes the following likelihood term:

The first term after the summation in Eq. (4) is determined by SE_{x} and SP_{x}, while the second follows directly from Eq. (1). The overall likelihood is proportional to the product, i.e.,

While it may technically be possible to estimate (β_{1}, …, β_{p}) based only on main study data without supplying values of misclassification probabilities,^{11,18} these parameters will be weakly identifiable at best.^{11} Neuhaus^{20} notes that estimability of misclassification rates is compromised under misspecification of the primary model (Eq. (1) here), further emphasizing the limited utility of a main study-only analysis. Thus, use of Eq. (5) is effectively limited to sensitivity analysis, wherein one supplies assumed values of SE_{x} and SP_{x}.

Both the EM approach of Magder and Hughes^{18} and the alternative maximum likelihood conceptualization of Neuhaus^{20} purport to maximize Eq. (5) after prespecifying sensitivity and specificity values. For an implementation under nondifferential misclassification that adapts readily to the differential case, Appendix 1 provides ready-to-use computer code utilizing the capacity for user-specified log-likelihood functions in the SAS NLMIXED procedure.^{31} To specify the likelihood to the level of detail required for programming, note that Eq. (5) may be written as follows under nondifferentiality:

where

with,

via Eq. (1). A special case of general likelihood expressions provided in Carroll et al,^{11} this structure is directly reflected in the first sample program in Appendix 1.

### Main Study + External Validation Data: Nondifferential Misclassification

Because sensitivity analysis is seldom a fully satisfying solution, we emphasize using validation data to estimate (β_{1}, …, β_{p}) in Eq. (1) without prespecifying sensitivity and specificity values. When the validation sample is external^{1} (ie, separate from the main study), we confine attention to the nondifferential case because external studies seldom measure the same covariates as the main study. External validation is also limited by a need to assume “transportability,” ie, that sensitivity and specificity parameters targeted in the validation sample are identical to those operating in the main study.^{11,12} In the remainder of the paper, we use the shorthand “main/external” and “main/internal” to refer to settings in which main study data are combined with external or internal validation data, respectively.

Given that our primary focus is upon the analysis of main/internal study data as required in the motivating example, we relegate details of the main/external case to Appendix 2. The structure of the resulting main/external likelihood is reflected in the second SAS NLMIXED program found in Appendix 1.

### Main Study + Internal Validation Data: Differential Misclassification

Our main interest lies in the case in which an internal validation sample (of size n_{v}) is randomly selected from the overall study sample. Again, main study experimental units contribute records of the form (y_{i}*, x_{i1}, …, x_{iP}). In contrast, resources are expended toward those selected for validation to augment their records with the true outcome status (y_{i}). Benefits of this supplemental data collection effort include removal of concern about transportability and flexibility to allow general patterns of differential misclassification.

As a first example, consider the case of 2 covariates, one continuous (X_{1}) and one binary (X_{2}), where sensitivity and specificity depend on X_{2}. That is, define SE_{t} = Pr(Y* = 1 ∥ Y = 1, X_{2} = t) and SP_{t} = Pr(Y* = 0 ∥ Y = 0, X_{2} = t) (t = 0, 1). Main study contributions remain of the form in (4), yielding the following main study likelihood:

where I(.) is a binary (0, 1) indicator for whether the condition in parentheses is true. In contrast, internal validation data records contribute terms of the form

yielding an internal validation subsample likelihood as follows:

Again, the full likelihood is proportional to *L* = *L*_{m} × *L*_{v}.

For the general case in which model (1) includes arbitrary predictors (X_{1}, …, X_{P}), we assume sensitivity and specificity depend on (X_{2}*, …, X_{K}*), which may denote a subset of (X_{1}, …, X_{P}) and/or include other variables or interaction terms. We favor a second logistic model to define associations between these predictors and sensitivity/specificity:

Assuming an adequate internal validation sample, Eq. (9) allows us to flexibly account for differential misclassification. It does so in a potentially robust manner when (X_{2}*, …, X_{K}*) consists of categorical variables. For subject i contributing predictor values **x**_{i}, Eq. (9) implies that

and

where

Maximum likelihood estimates (MLEs) for the differential sensitivity and specificity parameters follow from the MLE of θ = (θ_{1}, …, θ_{K}).

The full likelihood for this general case is proportional to *L* = *L*_{m} × *L*_{v}, where

(identical to Eq. (6) except for covariate effects on sensitivity and specificity), and

This likelihood structure is reflected in the third SAS NLMIXED program in Appendix 1. The likelihood itself is equivalent to a general expression found in the paper by Carroll et al.^{11} We present it more explicitly here to enhance its clarity and connection with the provided program.

As with any parametric model, Eq. (9) makes likelihood ratio tests available based on Eqs. (11) and (12) to aid in model selection and assess whether predictors are associated with sensitivity and specificity. This permits testing the hypothesis of completely nondifferential misclassification, ie, H_{0}: (θ_{2} = θ_{3} = … = θ_{K}) = 0.

### Comments Regarding Case-control Data

Prior treatments of outcome misclassification^{18} offered limited or no applicability under outcome-dependent sampling, despite well-known classic results^{32} establishing the utility of logistic regression for retrospective studies. It is thus of interest to explore whether and to what extent the recommended maximum likelihood approach accommodates the case-control design. By “case-control” here, we imply that sampling is done based on the error-prone response (Y*), with a higher sampling probability applied to “cases” (those with Y* = 1) than to “controls” (those with Y* = 0). We find that, with certain caveats, the internal validation study-based analysis proposed here can be used without modification despite the application of such “case” oversampling.

Specifically, the method described in the previous subsection yields valid estimates of (β_{1}, …, β_{p}) under model (1) when sampling favors those with Y* = 1, assuming nondifferential misclassification of case/control status. As with the classic case,^{32} the intercept loses its original interpretation. For similar reasons, the likelihood-based estimates of sensitivity and specificity will no longer reflect the true diagnostic properties of Y*. Rather, these tend to be inflated and deflated, respectively, in concert with the oversampling of cases according to Y*. In fact, the fallibility of the internal validation-based sensitivity/specificity estimators due to “case” oversampling is key to the validity of the (β_{1}, …, β_{p}) estimates, as these estimators reflect the “operating” sensitivity and specificity of Y* under the sampling strategy employed. In contrast, direct analysis based on external validation data (or even employing correct assumed values of sensitivity and specificity) misconstrues the “operating” sensitivity and specificity, generally yielding inconsistent estimates of (β_{1}, …, β_{p}). This may explain why methods^{18,20–22} that are not based on internal validation data encounter problems for case-control studies.

The validity of the main/internal validation study-based maximum likelihood approach for such case-control sampling with nondifferential outcome misclassification recalls theoretical results in the statistical literature,^{33} and can be demonstrated by noting that terms involving the selection probabilities applied to those with Y* = 1 and Y* = 0 factor out of the likelihood. In contrast, no such clean factorization occurs under differential misclassification. Nevertheless, if differential outcome misclassification is appropriately modeled via Eqs. (11) and (12), empirical evidence via simulation under large samples suggests that the MLEs for some elements of (β_{1}, …, β_{p}) in model (1) may remain valid under “case” oversampling. Specifically, our experimentation suggests that β coefficients in model (1) remain reliably estimable if they correspond to predictor variables that are not needed in the second regression model (9) that defines sensitivity and specificity. A simulation study illustrating these points follows after the example section.

## EXAMPLE

Our example concerns data on bacterial vaginosis status for women in the HIV Epidemiology Research Study. A total of 1310 (871 HIV-infected and 439 at-risk uninfected) women were enrolled into this prospective study across 4 US cities from 1993 to 1995.^{34} Researchers diagnosed bacterial vaginosis semi-annually by 2 different techniques, referred to as the “CLIN” (clinically-based) and “LAB” (laboratory-based) methods. A CLIN diagnosis required the presence of 3 or more specific clinical conditions based on a modification of Amsel et al criteria,^{35} while LAB diagnoses were made via a sophisticated Gram-staining technique.^{36} Prior references^{37,38} provide details on these methods in the study. As in Gallo et al,^{38} we treat the more costly LAB method as a gold standard assessment, while the CLIN approach represents an accessible error-prone substitute. These authors found evidence of low sensitivity for the CLIN method, and suggested that its accuracy may suffer due to wide heterogeneity in bacterial vaginosis cases or due to the need for technicians to be trained to properly apply the subjective criteria of Amsel et al.^{38}

A unique feature of this example is that both LAB and CLIN diagnoses were made regularly. Thus, in addition to fitting a “naive” main study-only version of model (1) with CLIN status (Y*) substituted for LAB (Y), we were able to fit Eq. (1) to data using the assumed gold standard (Y) on all subjects. While the illustration of validation data-based adjusted analyses then requires ignoring LAB data on a random subset, an advantage is that we have an “ideal” complete-data model for comparison.

We use data from the fourth semi-annual study visit on 982 black, white, and Hispanic women who were 25 years or older at enrollment. Available variables potentially associated with bacterial vaginosis status include age, race, HIV status (0 if negative, 1 if positive), and HIV risk group (0 if via sexual contact; 1 if intravenous drug use). Study site and CD4 counts among HIV positives showed little association with bacterial vaginosis status in this sample.

Median age at enrollment was 37 years. Other potential bacterial vaginosis risk factors are distributed as follows: race/ethnicity (60% black, 24% white, 16% Hispanic); HIV status (69% positive, 31% negative); and HIV risk group (47% sexual, 53% intravenous drug use). Among women with data on bacterial vaginosis, 41% were positive via the LAB method versus 25% based on CLIN. Unadjusted estimates were 0.53 (sensitivity) and 0.94 (specificity), suggesting that CLIN yields a low risk of false positives but high risk of false negatives.

For an “ideal” comparative analysis, we first fit Eq. (1) to all women, with the gold standard diagnosis (LAB; 1 vs. 0) as the outcome. Preliminary analyses revealed similar bacterial vaginosis prevalence among white and Hispanic women, so we created a binary variable (0 if nonblack, 1 if black). Initially dichotomizing age at the median, we assessed second- and higher-order interactions among age, race, HIV status, and risk group. A likelihood ratio test supported elimination of all 11 interaction terms.

A total of 924 women, with complete data on both bacterial vaginosis assessments and all risk factors, contributed to the fitted models summarized in Table 1. The upper half of the table summarizes the fit of the resulting version of model (1) for LAB status, in which we treat age (in years) continuously:

We then fit the same model upon substituting the error-prone CLIN diagnosis as the outcome (lower half of Table 1). The 2 analyses differ markedly in terms of magnitude of the estimated OR for HIV risk group (1.50 for LAB, 2.68 for CLIN), and directionality of the estimated OR for HIV status (1.19 for LAB, 0.71 for CLIN).

To illustrate misclassification adjustment, we selected a random internal validation subset of size n_{v} = 300 women. Predictor selection via model (9) fit to these 300 women revealed no independent association between race and CLIN status. Pairwise and higher-order interactions among LAB status, risk group, HIV status, and age (dichotomized for purposes of estimating sensitivity and specificity) were nonsignificant as a group. The version of Eq. (9) used in the main/internal validation study likelihood is

where AGEGTMED indicates whether a subject's age at enrollment exceeded the median.

The upper half of Table 2 summarizes a complete analysis of the data via the joint likelihood in Eqs. (11) and (12). For comparison, the lower half of Table 2 gives corresponding results assuming nondifferential misclassification (restricting θ_{2} = θ_{3} = θ_{4} = 0 in Eq. (14)). The likelihood ratio test comparing the joint models with and without the nondifferentiality assumption was highly significant (χ^{2} = 20.1, *P* < 0.001), strongly confirming a need to account for dependence of the sensitivity and specificity of the CLIN diagnosis upon subject-specific covariates. Note that the analysis in the upper half of Table 2 yields the same interpretations as the “ideal” analysis (upper half, Table 1), in terms of directionalities and magnitudes of the estimated ORs. In contrast, results in the lower half of Table 2 are similar to those of the “naive” analysis (lower half, Table 1), showing an elevated estimate for risk group and negative directionality for HIV status. This highlights the value of internal validation data for modeling sensitivity and specificity.

Table 3 provides the MLE of (θ_{0}, θ_{1}, θ_{2}, θ_{3}, θ_{4}) in Eq. (14) based on the joint likelihood Eqs. (11) and (12). Note that all 3 predictors (risk group, HIV status, and age) are independently associated with sensitivity and specificity. Table 3 provides corresponding MLEs of (SE, SP) via equations (9) and (10) with multivariate delta method-based standard errors (details available from the authors). Holding other variables constant, sensitivity tends to be higher (and specificity lower) for those who are in the intravenous drug use risk group, younger, or HIV-negative. The variations in these estimates give further credence to the differential nature of outcome misclassification in this real-data example.

## SIMULATION STUDIES

### Simulation I: Mimicking Real-data Example

Our primary simulation experiment evaluates the general main/internal validation study analysis outlined in Eqs. (9)–(12) under conditions mimicking the example. Four predictors (X_{1}–X_{4}) were randomly generated with distributions like those observed at study Visit 4 for “Race” (black vs. nonblack), “Risk Group,” “HIV status,” and “age,” respectively. True outcomes (Y) were simulated according to Eq. (13), with β coefficients equal to the estimates reported in the top portion of Table 2. Error-prone outcomes (Y*) were generated via Eq. (14), with θ's equal to the estimates at the top of Table 3. For 1000 such datasets, we conducted the naive analysis in addition to 2 main/internal validation analyses based on Eqs. (11) and (12) The first of these assumed the appropriate differential misclassification model, and the second incorrectly assumed nondifferentiality.

Table 4 summarizes the results. The naive analysis produces highly biased estimates, with means comparable to the estimates from the example with CLIN as the outcome (Table 1, bottom). Main/internal validation study-based analysis assuming the correct differential misclassification model produces reliable estimates of all 4 β coefficients, and excellent confidence interval (CI) coverage. In contrast, the main/internal analysis based on erroneously assuming nondifferentiality produces average parameter estimates remarkably similar to the estimates reported in the lower half of Table 2. These are invalid except for the estimate of β_{1}, corresponding to the predictor (X_{1}) that was unassociated with sensitivity and specificity in model (14).

### Simulation II: Misclassification of Case-control Status

Table 5 summarizes simulations assessing the internal validation study-based methods under “case-control” sampling as previously described. The version of model (1) for generating data was as follows:

where X_{1} is standard normally distributed and X_{2} is a Bernoulli(0.5) binary predictor. The true regression coefficients were (β_{0}, β_{1}, β_{2}) = (−0.4, 2.0, 0.5). For both scenarios in Table 5, approximately 5200 observations were first generated via the above model in a cross-sectional manner. Error-prone (Y*) values were then generated, potentially allowing sensitivity and specificity to vary with X_{2} [ie, assuming SE_{t} = Pr(Y* = 1 ∥ Y = 1, X_{2} = t) and SP_{t} = Pr(Y* = 0 ∥ Y = 0, X_{2} = t) (t = 0,1)]. To mimic case-control sampling, we used 100% of data records with Y* = 1 in each case but retained only a 5% random sample of those with Y* = 0. Under these conditions, each simulated “case-control” sample contained approximately 1500 observations, of which 500 were randomly selected into an internal validation sample. The main/internal validation study likelihood to analyze each data set is specified in Eq. (7) and (8).

The top half of Table 5 summarizes results for a nondifferential case, in which SE_{1} = SP_{1} = SE_{0} = SP_{0} = 0.8. As noted above under “Comments Regarding Case-Control Data,” maximum likelihood estimates of the SE and SP parameters differed from the true value of 0.8 on average, reflecting the “operating” sensitivity and specificity under “case” oversampling. However, estimates of β_{1} and β_{2} are quite reliable, with means near the true values of 2 and 0.5 and near-nominal CI coverage. The bottom half of Table 5 summarizes a differential case, where SE_{1} = 0.8, SP_{1} = 0.7, SE_{0} = 0.6, and SP_{0} = 0.9. Note that the coefficient (β_{1}), corresponding to the predictor (X_{1}) that was not associated with sensitivity and specificity, remains validly estimated. As also mentioned earlier, however, validity for estimating β_{2} is lost subsequent to X_{2}'s direct association with sensitivity and specificity. In both cases shown in Table 5, naive analysis based on Y* for case-control status yielded severe bias.

## DISCUSSION

We have considered the problem of outcome misclassification in logistic regression, with emphasis on clearly specifying likelihood functions corresponding to main/external and main/internal validation study designs. This emphasis distinguishes our work from related prior references in the epidemiologic literature,^{18,21,22} which do not pursue the incorporation of validation data. Although validation data-based maximum likelihood methods are outlined in the comprehensive text of Carroll et al,^{11} the treatment there is purposefully general and therefore made without a real-data example or facilitating computations. With the practicing epidemiologist in mind, we have sought to motivate such methods for handling outcome misclassification with a real-world study, and to make them fully accessible via user-friendly programs that directly reflect the likelihood specifications and use common software for optimization.^{31}

Our treatment includes detailed evaluation of maximum likelihood methodology via simulation studies. These simulations, along with the HIV study example, clearly demonstrate the importance of internal validation subsampling when misclassification is differential. The results in Tables 2 and 4 illustrate that outcome misclassification adjustment via an erroneous assumption of nondifferentiality may offer only marginal improvement over “naive” analysis based on the error-prone outcome (Y*).

We have demonstrated how the methods considered here can be used in the case-control setting, for which little discussion about outcome misclassification in logistic regression appears in the literature and for which prior proposals^{18} were not applicable. Assuming appropriate model specifications, we find that the maximum likelihood approach for the main/internal validation design illustrated here remains directly applicable in case-control studies with random “case” (ie, based on Y*) oversampling under nondifferential outcome misclassification. While further investigation of the impact of outcome-dependent sampling is warranted when misclassification is differential with respect to covariates, empirical studies suggest that the maximum likelihood approach maintains validity for estimating primary regression parameters associated with predictor variables that are not associated with sensitivity and specificity values.

Future work could involve extensions of past research on cost-efficiency^{39} to the logistic regression setting considered here, because the ultimate appeal of main/internal validation study designs is their potential for conserving resources. Somewhat along these lines, we experimented with further simulations under the same conditions as were assumed in producing Table 4, but varying the size of the internal validation subsample. We found that decreasing the validation sampling fraction to select as few as 5% of the 1000 subjects very seldom produced numerical problems with the maximum likelihood routine, despite expected increases in variability of the adjusted log odds ratio estimates. From a practical standpoint, the simulation program used to produce the results in Table 4 is a sharable resource that could aid an investigator in determining the validation fraction indicated for a particular study, and provide insight into the cost-efficiency of a main/internal validation design.

There may also be interest in regression-based methods to adjust for outcome misclassification in situations where no gold standard exists, but one has access to replicates of an error-prone outcome measure or to a diagnostic measure viewable as an “alloyed” gold standard.^{40,41} Additionally, there would be value in making methods^{29} that nonparametrically estimate the distribution of Y* ∥ (Y, X) more readily accessible in practice. Nevertheless, the logistic regression approach advocated in Eqs. (9) and (10) facilitates likelihood ratio testing and is potentially robust when all predictors in Eq. (9) are categorical, given the freedom to saturate that model.

## ACKNOWLEDGMENTS

The following is a list of the HIV Epidemiology Research Study Group: Robert S. Klein, Ellie Schoenbaum, Julia Arnsten, Robert D. Burk, Chee Jen Chang, Penelope Demas, and Andrea Howard, from Montefiore Medical Center and the Albert Einstein College of Medicine; Paula Schuman, and Jack Sobel, from the Wayne State University School of Medicine; Anne Rompalo, David Vlahov, and David Celentano, from the Johns Hopkins University School of Medicine; Charles Carpenter, and Kenneth Mayer, from the Brown University School of Medicine; Ann Duerr, Lytt I. Gardner, Charles M. Heilig, Scott Holmberg, Denise Jamieson, Jan Moore, Ruby Phelps, Dawn Smith, and Dora Warren, from the Centers for Disease Control and Prevention; and Katherine Davenny, from the National Institute of Drug Abuse.

## REFERENCES

1.Thomas D, Stram D, Dwyer J. Exposure measurement error: influence on exposure-disease relationships and methods of correction.

*Ann Rev Public Health.* 1993;14:69–93.

2.Bross IDJ. Misclassification in 2 × 2 tables.

*Biometrics.* 1954;10:478–486.

3.Barron BA. The effects of misclassification on the estimation of relative risk.

*Biometrics.* 1977;33:414–418.

4.Kleinbaum D, Kupper L, Morgenstern H.

*Epidemiologic Research: Principles and Quantitative Methods.* Belmont, CA: Lifetime Learning; 1982.

5.Greenland S, Kleinbaum DG. Correcting for misclassification in two-way tables and matched-pair studies.

*Int J Epidemiol.* 1983;12:93–97.

6.Greenland S. Variance estimation for epidemiologic effect estimates under misclassification.

*Stat Med.* 1988;7:745–757.

7.Marshall RJ. Validation study methods for estimating proportions and odds ratios with misclassified data.

*J Clin Epidemiol.* 1990;43:941–947.

8.Morrissey MJ, Spiegelman D. Matrix methods for estimating odds ratios with misclassified exposure data: extensions and comparisons.

*Biometrics.* 1999;55:338–344.

9.Lyles RH. A note on estimating crude odds ratios in case-control studies with differentially misclassified exposure.

*Biometrics.* 2002;58:1034–1037.

10.Greenland S. Maximum-likelihood and closed-form estimators of epidemiologic measures under misclassification.

*J Stat Plan Inf.* 2008;138:528–538.

11.Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM.

*Measurement Error in Nonlinear Models, Second Edition.* London: Chapman and Hall; 2006.

12.Lyles RH, Zhang F, Drews-Botsch C. Combining internal and external validation data to correct for exposure misclassification: a case study.

*Epidemiology.* 2007;18:321–328.

13.Rosner B, Spiegelman D, Willett WC. Correction of logistic regression relative risk estimates and confidence intervals for measurement error: the case of multiple covariates measured with error.

*Am J Epidemiol.* 1990;132:734–745.

14.Rosner B, Spiegelman D, Willett WC. Correction of logistic regression relative risk estimates and confidence intervals for random within-subject measurement error.

*Am J Epidemiol.* 1992;136:1400–1413.

15.Spiegelman D, Casella M. Fully parametric and semi-parametric regression models for common events with covariate measurement error in main study/validation study designs.

*Biometrics.* 1997;53:395–400.

16.Spiegelman D, Carroll RJ, Kipnis V. Efficient regression calibration for logistic regression in main study/internal validation study designs with an imperfect reference instrument.

*Stat Med.* 2001;20:139–160.

17.Thurston SW, Williams PL, Hauser R, Hu H, Hernandez-Avila M, Spiegelman D. A comparison of regression calibration approaches for designs with internal validation data.

*J Stat Plan Inf.* 2003;131:175–190.

18.Magder LS, Hughes JP. Logistic regression when the outcome is measured with uncertainty.

*Am J Epidemiol.* 1997;146:195–203.

19.Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm.

*J Roy Stat Soc B.* 1977;39:1–38.

20.Neuhaus JM. Bias and efficiency loss due to misclassified responses in logistic regression.

*Biometrika.* 1999;86:843–855.

21.Lash TL, Fink AK. Semi-automated sensitivity analysis to assess systematic errors in observational data.

*Epidemiology.* 2003;14:451–458.

22.Fox MP, Lash TL, Greenland S. A method to automate probabilistic sensitivity analyses of misclassified binary variables.

*Int J Epidemiol.* 2005;34:1370–1376.

23.Stamey JD, Young DA, Seaman JW. A Bayesian approach to adjust for diagnostic misclassification between two mortality causes in Poisson regression.

*Stat Med.* 2008;27:2440–2452.

24.Meier AS, Richardson BA, Hughes JP. Discrete proportional hazards models for mismeasured outcomes.

*Biometrics.* 2003;59:947–954.

25.Margaret AS. Incorporating validation subsets into discrete proportional hazards models for mismeasured outcomes.

*Stat Med.* 2008;27:5456–5470.

26.Paulino CD, Soares P, Neuhaus J. Binomial regression with misclassification.

*Biometrics.* 2003;59:670–675.

27.McInturff P, Johnson WO, Cowling D, Gardner IA. Modelling risk when binary outcomes are subject to error.

*Stat Med.* 2004;23:1095–1109.

28.Gerlach R, Stamey J. Bayesian model selection for logistic regression with misclassified outcomes.

*Stat Modelling.* 2007;7:255–273.

29.Pepe MS. Inference using surrogate outcome data and a validation sample.

*Biometrika.* 1992;79:355–365.

30.Cheng KF, Hsueh HM. Correcting bias due to misclassification in the estimation of logistic regression models.

*Stat Prob Letters.* 1999;44:229–240.

31.SAS Institute, Inc.

*SAS/STAT 9.1 User's Guide.* Cary, NC: SAS Institute, Inc; 2004.

32.Prentice RL, Pyke R. Logistic disease incidence models and case-control studies.

*Biometrika.* 1979;66:403–411.

33.Carroll RJ, Wang S, Wang CY. Prospective analysis of logistic case-control studies.

*J Am Stat Assoc.* 1995;90:157–169.

34.Smith DK, Warren DL, Vlahov D, et al. Design and baseline participant characteristics of the Human Immunodeficiency Virus Epidemiology Research (HER) Study: A prospective cohort study of human immunodeficiency virus infection in US women.

*Am J Epidemiol.* 1997;146:459–469.

35.Amsel R, Totten PA, Spiegel CA, Chen KC, Eschenbach D, Holmes KK. Nonspecific vaginitis: diagnostic criteria and microbial and epidemiologic associations.

*Am J Med.* 1983;74:14–22.

36.Nugent RP, Krohn MA, Hillier SL. Reliability of diagnosing bacterial vaginosis is improved by a standardized method of gram stain interpretation.

*J Clin Microbiol.* 1991;29:297–301.

37.Jamieson DJ, Duerr A, Klein RS, et al. Longitudinal analysis of bacterial vaginosis: findings from the HIV epidemiology research study.

*Obstet Gyn.* 2001;98:656–663.

38.Gallo MF, Jamieson DJ, Cu-Uvin S, Rompalo A, Klein RS, Sobel JD. Accuracy of clinical diagnosis of bacterial vaginosis by human immunodeficiency virus infection status.

*Sex Transm Dis.* 2010 Oct 29 [Epub ahead of print].

39.Spiegelman D, Gray R. Cost-efficient study designs for binary response data with Gaussian covariate measurement error.

*Biometrics.* 1991;47:851–869.

40.Wacholder S, Armstrong B, Hartge P. Validation studies using an alloyed gold standard.

*Am J Epidemiol.* 1993;137:1251–1258.

41.Brenner H. Correcting for exposure misclassification using an alloyed gold standard.

*Epidemiology.* 1996;7:406–410.

## APPENDIX 1: SAS PROC NLMIXED Code for Misclassification-adjusted Analyses

### Known Misclassification Rates (for Sensitivity Analysis)

The following SAS code implements ML analysis for model (1), assuming nondifferential outcome misclassification with known (SE,SP) and a single continuous covariate X. Observed data records consist only of (y*, x) pairs; these variables are ‘ystar’ and ‘x' in the input data set. The likelihood to be maximized is given in Eq. (6).

- PROC NLMIXED cov;
- parms b0 = 0 b1 = 0; **starting values**;
- SE = .8; SP = .8; **assumed values**;
- tau = b0 + b1*x;
- py1gx = exp(tau)/(1 + exp(tau));
- py0gx = 1 − py1gx;
- term1mn = ((1 − SP)*py0gx + SE*py1gx)**ystar;
- term2mn = (SP*py0gx + (1 − SE)*py1gx)**(1 − ystar);

like = term1mn*term2mn;

loglik = log(like);

- model ystar ∼ general(loglik);
- run;

For sensitivity analysis, one varies the (SE, SP) inputs to assess the impact upon the MLE of the parameter(s) of interest (β_{1} here), and upon standard errors.

### Main/External Validation Study

The following code implements ML analysis for a main/external validation scenario, assuming non-differential outcome misclassification and a single covariate (X). The input data set consists of n_{m} main study and n_{v} external validation study observations stacked together, and contains the variables ‘ystar’ and ‘x’ (corresponding to Y* and X in the main study) and ‘ystarv’ and ‘yv’ (corresponding to Y* and Y in the validation study). The variable ‘extval’ is a (0,1) indicator for whether an observation comes from the validation sample. The likelihood maximized is *L* = *L*_{m} × *L*_{v}, with *L*_{m} given in Eq. (6) and *L*_{v} given in Appendix 2, respectively.

- PROC NLMIXED cov;
- parms b0 = 0 b1 = 0 SE = .8 SP = .8 py = .5; **starting values**;
- tau = b0 + b1*x;
- py1gx = exp(tau)/(1 + exp(tau));
- py0gx = 1 − py1gx;
- term1mn = ((1 − SP)*py0gx + SE*py1gx)**ystar;
- term2mn = (SP*py0gx + (1 − SE)*py1gx)**(1 − ystar);
- term1v = (SE*py)**(ystarv*yv);
- term2v = ((1 − SP)*(1 − py))**(ystarv*(1 − yv));
- term3v = ((1 − SE)*py)**((1 − ystarv)*yv);
- term4v = (SP*(1 − py))**((1 − ystarv)*(1 − yv));

like = ((term1mn*term2mn)**(1 − extval))*

((term1v*term2v*term3v*term4v)**extval);

loglik = log(like);

- model ystar ∼ general(loglik);
- run;

### Main/Internal Validation Study

The following code implements ML analysis for a main/internal validation scenario, assuming differential outcome misclassification. This code was used for the HERS data analysis in Tables 2 and 3, and for the simulation study in Table 4. The input data set consists of n_{m} main study and n_{v} internal validation study observations, and contains the variables ‘ystar’ and ‘x1 − x4’ (corresponding to Y* and X_{1}–X_{4}), as well as ‘yv’ and ‘x4gtmed’ (corresponding to Y and an indicator for whether X_{4} [age in the example] exceeds its median). The variable ‘intval’ is a (0,1) indicator for whether an observation comes from the validation sample. The likelihood maximized is *L* = *L*_{m} × *L*_{v}, with *L*_{m} and *L*_{v} given in Eqs. (11) and (12). Also see Eqs. (9) and (10) for the definitions of ′eta1′ and ′eta0′.

- PROC NLMIXED cov;
- parms b0 = 1 b1 = 1 b2 = .5 b3 = .25 b4 = 0 thet0 = -2.5
- thet1 = 2 thet2 = 1 thet3 = -1 thet4 = -.2; **starting values**;
- tau = b0 + b1*x1 + b2*x2 + b3*x3 + b4*x4;
- py1gx = exp(tau)/(1 + exp(tau));
- py0gx = 1 − py1gx;
- eta0 = thet0 + thet2*x2 + thet3*x3 + thet4*x4gtmed;
- eta1 = thet0 + thet1 + thet2*x2 + thet3*x3 + thet4*x4gtmed;
- SEx = exp(eta1)/(1 + exp(eta1));
- SPx = 1 − exp(eta0)/(1 + exp(eta0));
- term1mn = ((1 − SPx)*py0gx + SEx*py1gx)**ystar;
- term2mn = (SPx*py0gx + (1 − SEx)*py1gx)**(1 − ystar);
- term1v = (SEx*py1gx)**(ystar*yv);
- term2v = ((1 − SPx)*py0gx)**(ystar*(1 − yv));
- term3v = ((1 − SEx)*py1gx)**((1 − ystar)*yv);
- term4v = (SPx*py0gx)**((1 − ystar)*(1 − yv));

like = ((term1mn*term2mn)**(1 − intval))*

((term1v*term2v*term3v*term4v)**intval);

loglik = log(like);

- model ystar ∼ general(loglik);
- run;

## APPENDIX 2: Details for ML Analysis of Main Study + External Validation Data

Assume there are n_{m} experimental units in the main study, each providing a record of the form (y_{i}*, x_{i1}, …, x_{iP}). Each such unit makes the likelihood contribution in Eq. (4), slightly altered to reflect the assumption of non-differential misclassification:

Assume the external validation study provides a 2 × 2 table containing n_{v} pairs (y_{j}*, y_{j}), j = 1, …, n_{v}. Each pair makes the following contribution:

The first term in the preceding expression is dictated by the values of sensitivity and specificity, while the second represents a nuisance parameter (p_{y}), reflecting prevalence of the true outcome (Y) under the validation study sampling conditions. The full likelihood is proportional to *L* = *L*_{m} × *L*_{v}, with

defined as Eq. (6) and

See the second subsection in Appendix 1 for SAS NLMIXED code designed to maximize the likelihood for the main/external case.