Diagnostic tests are used to determine the presence of a particular disease in patients exhibiting symptoms or risk factors for the disease. They are different from screening tests, which are performed before disease diagnosis in patients who are asymptomatic. Most often, diagnostic tests are compared with a “gold standard” which would be an existing diagnostic tool that is considered comparable with a “true” measure of the physiologic, biochemical, or pathological state the test is evaluating. Examples of a gold standard may be a positive histology for cancer on biopsy or an autopsy to confirm the presence of a disease. Diagnostic tests that are developed must also be validated to be useful and it's important for neuro-ophthalmologists to be familiar with the process of building a model to validate a new diagnostic test. Sensitivity, specificity, and their relatives (Youden's index, positive, and negative predictive values) describe the validity of a diagnostic test. Receiver operating characteristic (ROC) curve analysis can be used to determine how the performance of a diagnostic model will change as the discrimination threshold is varied. Logistic regression can be applied to develop diagnostic models to determine the relationship of the predictors and the disease. This article is presented as a companion to Selby et al “Temporal Artery Biopsy in the Workup of Giant Cell Arteritis: Diagnostic Considerations in a Veterans Administration Cohort” (1) in this issue of JNO and presents an overview of methods used in developing and assessing diagnostic models.
SENSITIVITY, SPECIFICITY, AND RELATIVES
A diagnostic test result is commonly represented by a dichotomous variable of positivity and negativity or a continuous variable dichotomized by a cut-point threshold, and validity of a diagnostic test can be measured with sensitivity, specificity, and their relatives. A 2 by 2 contingency table can be plotted to show the diagnostic test results in comparison with the gold standard results (Table 1). True positive (TP) and true negative (TN) results occur when the diagnostic test accurately predicts results aligning with the gold standard (cells “a” and “d” in the table, respectively). False positives (FP) and false negatives (FN) occur when the diagnostic test results do not accurately correspond to the gold standard results (cells “b” and “c” in the table, respectively). Note that “a + c” are the “true” positives from the gold standard, “b + d” are the “true” negatives from the gold standard, “a + b” are the positives from the diagnostic test, and “c + d” are the negatives from the diagnostic test. Selby et al used a gold standard of temporal artery biopsy (TAB) to detect giant cell arteritis (GCA). Because TAB is a nontreatment procedure that is time-consuming and expensive, the goal of Selby et al's analysis was to find better measures with diagnostic utility. They assessed the sensitivity and specificity of various clinical criteria in a population of US military veterans.
TABLE 1. -
Two by two contingency table of gold standard and diagnostic test results
||Gold Standard/Reference “True” Value
|Diagnostic test results
||a + b
||c + d
||a + c
||b + d
||a + b + c + d
FN, false negative; FP, false positive; TP, true positive; TN, true negative.
Sensitivity (Se) is a measure of the true positive results detected by the diagnostic test being modeled in comparison with the gold standard or reference “true” value. It is the proportion of true positive results detected by the diagnostic test divided by the actual number of “true” positives determined by the gold standard or is better understood as the probability of having a positive test if the disease is present. The formula is
Specificity (Sp) provides information about the ability of a diagnostic test to detect true negatives. In diagnostic models, this is sometimes considered more important than sensitivity because of the potential impact on a patient if they have a false positive result (2). Specificity is calculated as the proportion of negative results on the diagnostic test divided by the number of “true” negatives from the gold standard, and estimates the probability of having a negative test if the disease is not present. The formula is
Selby et al calculated the sensitivity and specificity for 7 clinical criteria including new-onset headache, jaw claudication, vision changes, scalp tenderness, polymyalgia rheumatica, elevated erythrocyte sedimentation rate (ESR), and elevated C-reactive protein (CRP) as presented in Table 2 of their study. The first 5 potential diagnostic factors are dichotomous variables (either yes or no for their presence) whereas the last 2 (ESR and CRP) are continuous variables for which the authors chose a cut-point of positivity to dichotomize the variable to be included in the diagnostic model. The age-adjusted upper limit of normal (ULN) for the ESR was calculated using ESR = age/2 for men and ESR = (age + 10)/2 for women, and the ULN value of 11 mg/L for CRP from their local Portland VA laboratory was used to define elevated CRP.
TABLE 2. -
Sensitivity and specificity example using hypothetical data
ESR, erythrocyte sedimentation rate; TAB, temporal artery biopsy.
Calculation of sensitivity and specificity can best be illustrated using an example of a classic 2 by 2 contingency table. Actual numbers used in the calculations were not presented in Selby et al's study; however, a hypothetical model can be created (Table 2). For the clinical criterion of the elevated ESR, 292 patients were included in the study, and a sensitivity of 80.0% and specificity of 41.1% were calculated. Considering that there were 40 cases with positive TAB and 252 with negative TAB, sensitivity = 32/40 = 80.0% and specificity = 104/252 = 41.3%. A high value for sensitivity indicates good ability of ESR to detect true positive cases of GCA, whereas poor sensitivity suggests poor ability to detect true negatives from false positives. This correlates with the authors' suggestion that elevated ESR can be found in many other diseases and is not specific to GCA, although it may be common. Elevated CRP also had good sensitivity but poor specificity. Conversely, jaw claudication, scalp tenderness, and polymyalgia rheumatic had high specificity and low sensitivity. These results suggest that these criteria are able to predict true negatives fairly accurately, but cannot distinguish true positive cases from false negative cases, meaning if patients do not have a positive GCA they probably will not have positive jaw claudication, but it is difficult to determine whether they will have a positive GCA when positive jaw claudication is present.
Sensitivity and specificity are inversely proportional; as sensitivity increases, specificity will necessarily decrease and vice versa. Had different values for elevated ESR or CRP been chosen, the values for sensitivity and specificity would change in proportion to each other. Selby et al calculated sensitivity of 50.00% and specificity of 88.36% for their overall model and mentioned in their analysis that they could adjust the cut-point for positivity on the diagnostic test to improve sensitivity, but this would decrease the specificity. Note that sensitivity is only related to “true” positive participants “a + c,” and specificity is only related to “true” negative participants “b + d,” and sometimes we need to simultaneously consider both aspects.
Youden's index J is defined as the difference between the true positive rate (sensitivity) and false positive rate (1-specificity), ranging from 0 through 1, with 0 indicating that the diagnostic test has same positive rates for participants with or without the disease, and 1 indicating that there are no false positive or false negative. The formula is
Positive Predictive Value and Negative Predictive Value
Positive predictive value (PPV) is defined as the probability that a participant indeed has the disease given that they have a positive diagnostic test result, and negative predictive value (NPV) is the probability of no disease given a negative test result. Predictive values may provide results that directly apply to decision about test usefulness; however, they are related to disease prevalence “P.” When the prevalence in participants of diagnostic test equals to the disease prevalence in population (i.e., ), the PPV is and NPV is . Otherwise the higher disease prevalence leads to an elevated PPV and decreased NPV (3,4). PPV and NPV can be estimated using Bayes' theorem from the sensitivity and specificity if the prevalence “P” is known (3), and the formulas are
Besides the above principal and relative statistics, the validity of a diagnostic test can be affected by disease severity and by frequency of the condition (4). In addition, potential bias such as detection bias should be considered when designing a study to evaluate diagnostic criteria. It may be prudent to perform multiple studies that incorporate a broad spectrum of cases and controls on a diagnostic measure to assess for validity.
RECEIVER OPERATING CHARACTERISTIC CURVE ANALYSIS
When the binary diagnostic test result is determined by dichotomizing a continuous test score (or combination of values from several discrete variables) using a cutoff point, sensitivity and specificity can be calculated across a spectrum of threshold values for test results that are reported on a continuous scale or for each level of result on an ordinal scale (i.e., normal, borderline, and abnormal). In this case, ROC curve analysis can be used to determine the accuracy of a diagnostic model and to determine optimal cut-points to use as thresholds for a diagnostic test. True positive fraction (TPF = sensitivity) can be plotted against false positive fraction (FPF = 1-specificity) to make an ROC curve. Usually, the straight diagonal line connecting (0, 0) and (1, 1) is added as a reference line representing random chance.
Area Under the Curve of Receiver Operating Characteristic
The area under the curve (AUC) is a measure of the ability of the diagnostic test to accurately predict true positives and negatives (3,5). An AUC of 1 indicates a perfect diagnostic test which can exactly differentiate between diseased and nondiseased patients, whereas an AUC of 0 would indicate no ability to correctly classify patients with the disease vs nondiseased. An AUC of 0.5 corresponds to the diagonal reference line indicating no ability greater than chance to correctly identify disease. Generally, an AUC of 0.7 or better is considered good, with an AUC between 0.8 and 0.9 considered excellent and greater than 0.9 outstanding (5). The confidence interval (CI) for the AUC should be calculated to provide statistical inference for the estimated AUC, and AUCs of different diagnostic models can be compared using DeLong's nonparametric approach with analyzing the difference between curves accounting for the correlated nature of the data with paired comparisons (6). Selby et al reported an AUC of 0.7440 showing the accuracy of multivariable logistic regression in predicting a positive temporal artery biopsy. This value of AUC indicates good performance of the model to distinguish between those with GCA and those without GCA based on the clinical criteria included in the model.
Cut-Points of Receiver Operating Characteristic
As mentioned before, points on the ROC curve correspond to various cut-points used to determine test positivity, and the distribution of positive and negative diagnostic test results will vary if the criterion or cut-point for positivity changes. Because of the inverse relationship of sensitivity and specificity, it is important to find a balance and determine the best threshold for positivity on test results. Threshold values that optimize sensitivity and specificity can be determined from the ROC curve in several proposed different manners. One method is to use the cut-point corresponding to the 45° tangent line intersection of the ROC curve which is equivalent to the point where sensitivity and specificity are closest together (7). Youden's index is another method and maximizes the difference between TPF (sensitivity) and FPF (1-specificity) (3). A third method is to use the cut-point corresponding to the smallest sum of squares of 1-sensitivity and 1-specificity (8).
When multiple continuous or discrete independent variables are simultaneously considered in diagnostic tests, logistic regression is widely applied to build the diagnostic model. Logistic regression can be used in both cohort and case–control, matched and unmatched, retrospective and prospective study designs, and has relative ease of interpretation. Potential confounders can also be accounted for in logistic regression by adding them as covariates into the model (9).
Odds and Odds Ratio
To understand the results from logistic regression, it is helpful to know the concepts of odds and odds ratio first. Odds and odds ratios can be calculated from a 2 by 2 contingency table (Table 3), where a dichotomous exposure (such as vision changes in Selby et al's study) and a binary outcome (disease such as GCA in Selby et al's study) are plotted together. Supposing “p” represents the probability of disease, and “q = 1−p” represents the probability of no disease. Odds of disease are defined as the ratio of the probability of disease and the probability of no disease “,” which are always in the range of (0, ∞). Considering the probability of disease under dichotomous exposure, for the exposed and for the exposed. The odds of exposed are , and the odds of unexposed is . Then the odds ratio (OR) for the dichotomous exposure is defined as odds of being exposed over the odds of being unexposed: . In the case–control study design, OR is the ratio of the odds for cases and controls and can be calculated either as the cross-product ratio or in the 2 by 2 table (2). When the disease outcome is rare, the odds ratio can be used as an approximation of the relative risk for the disease based on exposure. Note that logistic regression is a special case of a generalized linear model (GLM) for binary outcome, and assumes a linear relationship exists between the independent variable and the logit of outcome, where logit is defined as the log base e (ln) of the odds. In some cases, the explanatory variables or outcome need to be transformed so that the linearity assumption can be met.
TABLE 3. -
Two by two contingency table of exposure and disease to calculate odds ratio
||a + b
||c + d
||a + c
||b + d
||a + b + c + d
A simple logistic regression model has the form:where Y is the binary outcome (such as positive TAB for GCA in Selby et al study), p(X) is the probability of the outcome given covariate X (such as probability of participants having positive TAB given their covariates), is the intercept, are categorical or continuous independent variables, and are the corresponding regression coefficients for Xi. The coefficients for the binary exposure variable Xi is the natural logarithm of OR because with the exponent of being the OR. Selby et al calculated OR=3.09 for scalp tenderness with 95% CI = 1.24–7.72 indicating patients reported scalp tenderness had 3.09 times higher odds of positive TAB than those who didn't report this symptom. For continuous variables, a one-unit increase in Xi multiplies the odds by , or lead to increase in log odds of outcome Y by an amount corresponding to . Positive is associated with an increase in log-odds of Y when Xi is increased, and negative indicates a decreasing log odds of Y when Xi increases (10). Selby et al calculated OR=1.50 for log (CRP) with 95% CI = 1.06–2.13, indicating a 1-unit increase in log (CRP) resulted in 1.50 times higher odds of positive TAB.
In Selby et al's study, 10 independent variables were assessed for inclusion in the logistic regression model. Normality was assessed for continuous variables, and 2 variables were log transformed (CRP and platelets) because their distribution were right skewed. The final logistic regression model included 5 variables (scalp tenderness, log [CRP], log [platelets], age, and vision changes), the first 3 of which were retained because they were statistically significant predictors of a positive TAB. Age was not a statistically significant predictor of positive TAB but was retained in the model because of its clinical significance in diagnosing GCA. Similarly, vision changes were not statistically significant but kept in the model because of information regarding missing values in the data set with this variable.
Assessment of Model Fit
The predictive performance of a logistic regression model can be assessed using the Hosmer–Lemeshow test, pseudo R2, and misclassification rate. The rationale behind the Hosmer–Lemeshow test is to evaluate how the observed event rates match the expected event rates in the subgroups of the model population. The Hosmer–Lemeshow test statistic asymptotically follows a chi-square distribution with (g − 2) degrees of freedom, where g is the number of subgroups. A P-value below 0.05 on the Hosmer–Lemeshow test indicates a poor fit for the model. Selby et al used the Hosmer–Lemeshow test to show the goodness of fit of the logistic regression model (11) and reported a P-value of 0.9220 indicating that there was no evidence of a poor fit for the model.
Although the concept is similar to the R2 in linear regression, pseudo R2 measures in logistic regression have a different interpretation and are not simply the proportion of variance explained by the model. Several pseudo R2 have been developed to evaluate the goodness-of-fit of logistic models (12). Selby et al reported a relatively low pseudo R2 value of 0.1319 for their final logistic regression model including 5 diagnostic predictors in positive TAB, but did not report what specific type of pseudo R2 they used.
The misclassification rate is a measure of how well the model predicts the “true” outcome and is calculated as the percent of patients who are incorrectly classified with the diagnostic test compared with the gold standard. Usually a cut-point of 0.5 is used to dichotomize the predicted probability, and a lower value of misclassification rate indicates a better predictive performance. In Table 1, it can be calculated as the sum of false positives (FPs) and false negatives (FNs) divided by the total number of tests performed, that is, . Selby et al report a misclassification rate of 16.28% for logistic model, which indicates good predictive performance of the final logistic model.
In summary, diagnostic models are important in any disease model and have widespread applications in the diagnosis of neuro-ophthalmic diseases. It is important to consider what reference value or gold standard to which a new diagnostic test is being compared. The principal concepts of sensitivity, specificity, and their relatives can be calculated and understood by developing a 2 by 2 contingency table with or without disease prevalence. ROC curve analysis can be used to assess the ability of a model to correctly identify true positives and false negatives and to determine optimal thresholds of positivity for a diagnostic test. Logistic regression can be applied to determine the best diagnostic model and determine odds ratios of the disease based on independent variables. It is important for readers to understand how diagnostic models are built, how to interpret their results, and how to assess the validity, accuracy, and fit of a diagnostic model.
1. Selby LD, Park-Egan BAM, Winges KM. Temporal artery biopsy in the workup of giant cell arteritis: diagnostic considerations in a veterans administration cohort. J Neuroophthalmol. 2020;40:450–456.
2. Gordis L. Epidemiology.
5th edition. Philadelphia, PA: Elsevier/Saunders, 2014.
3. Hajian-Tilaki K. Receiver operating characteristic (ROC) curve analysis for medical diagnostic test evaluation. Caspian J Intern Med. 2013;4:627–635.
4. Rothman KJ, Greenland S, Lash TL. Modern Epidemiology.
3rd edition. Philadelphia, PA: Wolters Kluwer Health/Lippincott Williams & Wilkins, 2008.
5. Mandrekar JN. Receiver operating characteristic curve in diagnostic test assessment. J Thorac Oncol. 2010;5:1315–1316.
6. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44:837–845.
7. Farrar JT, Young JP Jr, LaMoreaux L, Werth JL, Poole RM. Clinical importance of changes in chronic pain intensity measured on an 11-point numerical pain rating scale. Pain. 2001;94:149–158.
8. Froud R, Abel G. Using ROC curves to choose minimally important change thresholds when sensitivity and specificity are valued equally: the forgotten lesson of pythagoras. theoretical considerations and an example application of change in health status. PLoS One. 2014;9:e114468.
9. Peng CY, Lee KL, Ingersoll GM. An introduction to logistic regression analysis and reporting. J Educ Res. 2010;96:3–14.
10. James G, Witten D, Hastie T, Tibshirani R. An Introduction to Statistical Learning: With Applications in R.
New York, NY: Springer, 2013.
11. Lemeshow S, Hosmer DW Jr. A review of goodness of fit statistics for use in the development of logistic regression models. Am J Epidemiol. 1982;115:92–106.
12. Smith T, McKenna C. A comparison of logistic regression pseudo R2 indices. Mult Linear Regression Viewpoints. 2013;39:17–26.