# On the Equivalence of Classic ROC Analysis and the Lossfunction Model to Set Cut Points in Sequential Testing

##### Abstract

In an effort to reduce the cost of administration for objective structured clinical examinations (OSCEs), several authors have promoted the use of sequential testing in which all candidates take a short screening test and candidates who pass the screen are exempted from taking the full test. Traditionally, the determination of the optimally efficient cut point (passing score) for the screen has used ROC analysis to minimize false-positive and false-negative errors. Recently, Muijtjens et al. have questioned the appropriateness of the ROC method for these purposes and have promoted an alternative method that uses a “loss” formula. However, given certain theoretically derived conditions, it can be shown that the use of the loss formula is functionally identical to using ROC analysis, and the authors suggest that continued use of the ROC method is appropriate.

##### Author Information

**Dr. Regehr** is associate professor, Departments of Surgery and Psychiatry, University of Toronto Faculty of Medicine, Center for Research in Education, Toronto, Ontario, Canada. **Dr. Colliver** is professor and director, Division of Statistics and Research Consulting, Southern Illinois University School of Medicine, Springfield.

Correspondence and requests for reprints should be addressed to Dr. Regehr, University of Toronto Faculty of Medicine, Center for Research in Education, University Health Network, Eaton South 1-564, 200 Elizabeth Street, Toronto, ON M5G 2C4, Canada; e-mail: 〈g.regehr@utoronto.ca〉.

The objective structured clinical examination (OSCE), while highly successful as a tool for evaluating clinical skills, tends to be expensive and time-consuming.^{1} In an effort to reduce the cost and time required to generate reliable and valid scores on the OSCE, several authors have recommended the use of sequential testing.^{2,3,4,5,6,7,8} Sequential testing involves all candidates' taking a short screening test consisting of a subset of stations from the total examination. Based on scores from this screening test, predictions are made about each student's likely performance on the full test. Candidates who are identified as highly likely to pass the full test (“clear passes”) are passed on the basis of the screen alone and are exempted from the remainder of the examination. Candidates who are not identified as “clear passes” take the full examination and pass/fail decisions for these candidates are based on this more stable measure of the candidates' abilities.

Clearly, the utility of the screen depends on its ability to accurately identify a large proportion of the candidates who would pass the full test. However, when the correlation between the screen score and the total test score is less than perfect, there is a tradeoff between the accuracy of this prediction and the proportion of candidates who are exempted from the total examination. When the correlation between the screen and the full test is moderate, the test administrator must either set a high standard on the screen, forcing a large number of individuals who will pass to nonetheless take the full test, or accept that a substantial proportion of the candidates who would fail the full test will be inappropriately passed on the basis of the screen.

Traditionally, the receiver operating characteristic (ROC) curve has been used as the method for determining the optimal cut point on the screen.^{2,3,4,5,6,7} The ROC method is applied to a previous data set in which all candidates have taken both the screen and the full test. Using this type of data set, a series of cut points for the screen is selected that range from very strict (almost no candidates are passed on the basis of the screen alone) to very lenient (almost all candidates are passed on the basis of the screen alone). For each cut point, two types of errors are calculated: *the false-positive rate* (the proportion of candidates who would have been passed by the screen, but who would fail the full test) and *the false-negative rate* (the proportion of candidates who would have had to take the full examination and would pass the full examination). When the cut point selected for the screen is very strict, the false-positive rate is quite low but the false-negative rate will spiral quickly out of control. When the cut point selected is very lenient, the false-negative rate is low but the false-positive rate spirals quickly out of control. A midrange cut point balances the two forms of error. Authors have suggested a variety of method for balancing these two forms of error. Yet, until fairly recently, nobody questioned the underlying premise of the ROC method that these are the two forms of error that ought to be balanced.

Thus, the work of Muijtjens and his colleagues^{8} in which they introduced a new method for evaluating the optimum cut score for the screening test is interesting. In developing their rationale for introducing this new method, Muijtjens et al. argue that the appropriateness of classic ROC analysis is questionable. Specifically, they suggest that the traditional use of the false-negative rate as a measure of errors is inappropriate in the context of sequential testing. Their claim is based on the fact that all who fail the screen take the full test, so every candidate who should pass the test does. That is, nobody is inappropriately failed based only on his or her performance on the screen, and therefore the false-negative rate is functionally zero for the sequential testing procedure. Instead, they suggest, because all candidates who fail the screen must complete the entire test, it would be more appropriate to treat all individuals who fail the screen as an equal source of inefficiency in the screen regardless of whether they pass or fail the full test. The greater the number of individuals who must take the entire test, the less efficient the screen. This assertion, of course, is balanced against their recognition that the more efficient the test (the fewer the number candidates who take the entire test), the higher the likelihood of false positives (people who are passed on the basis the screen but who would have failed had they taken the entire test). Thus, whereas the classic ROC analysis attempts to minimize the overall error rate (false-negative rate + false-positive rate), Muijtjens et al.'s model attempts to minimize the combination of inefficiency and false positives (what they call the loss function).

We were intrigued by the theoretical argument Muijtjens et al. have made and we wanted to examine the impact of their formulation on decisions that might be made relative to the traditional ROC model. Interestingly, when we tried comparing their formulation based on the loss function and the ROC formulation, we were surprised to find that when certain theoretically derived conditions are met, the loss function introduced by Muijtjens et al. leads to results that are functionally identical to those obtained by the classic ROC analysis. This commentary is an effort to share our findings and describe our understanding of the reasons for this equivalence. To do so, we use a nomenclature based on Table 1.

Using this nomenclature, classic ROC analysis would describe the false-positive rate as the proportion of candidates who would fail the total test but who nonetheless pass on the basis of the screen:

and the false-negative rate as the proportion of candidates who would pass the total test but who nonetheless fail the screen and therefore must take the full test:

Thus, the calculation for the total error rate is:

and the best cut point is the score that minimizes the total error rate.

It is worth noting that, consistent with the traditional use of ROC analysis, Colliver et al.^{3} and Cass et al.^{6} use the minimum of the sum of the *squared* error rates to determine the best cut point, but for the purposes of demonstration here, we describe the total error rate as the sum of the simple error rates.

It is also worth noting that, in the formulation described above, the proportional “cost” of a false-negative result is equal to the proportional “cost” of a false-positive result. This need not be the case. It might be argued that, in fact, it is more “costly” to inappropriately pass a candidate on the basis of the screen alone than to unnecessarily ask an individual to sit the full examination. To accommodate for the relative importance of the two types of errors, one could weight them differently (by having a multiplier for one value or the other). The value of the weight is the relative value of the error (a weighting of two indicates that one type of error is twice as valuable as the other). When one error type is weighted more heavily, changes in that error rate will have more impact on total error rate than changes in the other, and the cut point that minimizes the total error will shift accordingly to differently minimize the more important (more heavily weighted) error. For the current demonstration, however, we continue to weight the two types of errors equally at 1.

For the loss function Muijtjens and his colleagues introduced, inefficiency (*N*) is calculated as the raw number of candidates who failed the screen and had to take the entire test (c + d), and the false positive (*FP*) is defined as the raw number of candidates who would have failed the total test but nonetheless passed the screen (b). They add these two components, but multiply the false positives by a weighting factor (*W*). Thus, *loss* is calculated as:

or, using our nomenclature:

The weighting factor used here serves the same purpose as the weighting factor described for classic ROC analysis above, providing differential importance to the two forms of error. It should also be noted that the value (1 + *W*) is used merely as a scalar to ensure that the weights associated with *N* and *FP* sum to 1 for all values of *W* (although this scaling factor has no implication for selecting the best cut point for a given weight). As with the ROC analysis, the best cut point is the one that minimizes the loss function for a given value of *W*.

The core of our argument for the equivalence of the two models (the ROC model and the model using the loss function) lies in the fact that, in their article, Muijtjens et al. provide no justification for selecting a value for *W*. In fact, they state explicitly that “it is difficult to decide what the value of *W* should be, because it involves an arbitrary judgment,”^{8}, ^{p.371} and they select a variety of arbitrary weights, from 1 to 128, in their paper. The selection of a weight appears particularly arbitrary in the loss function because *N* and *FP* are calculated in raw numbers and, therefore, do not have any numeric equivalence in the equation.

Thus, we suggest a very sensible, theoretical method for selecting an initial value of *W*. Because *N* and *FP* are calculated as raw scores, one good reason for having a weighting system is to place the values of *N* and *FP* on the same scale, which, in essence, gives the values of *N* and *FP* equal weight. The most practical method for placing both *N* and *FP* on the same scale is to proportionalize each value, to have each of the measures range from 0 to 1. To proportionalize their *FP* in this way, we would divide their *FP* by the maximum *FP* possible. Theoretically, if all who fail the overall test were, nonetheless, passed by the screen, their *FP* would be equal to the total number of people who fail the overall test. Thus, the maximum *FP* value is the total number of test failures (b + d), and the proportionalized *FP* would be calculated as

Likewise, to rescale *N* to a scale of 0 to 1, we would divide *N* by the maximum possible *N*. Theoretically the screen could fail all candidates, so the maximum *N* is the total number of test takers (a + b + c + d). Therefore the proprotionalized *N* would be calculated as

Of course, in the original formulation of *loss*, *N* is unweighted. To maintain the formula's original structure, we multiply the proportionalized values of both *N* and *FP* by the total number of test takers. Thus, to place *N* and *FP* on the same scale (a scale ranging from zero to the total number of test takers) we would calculate *W* to be

Functionally, when we used this value of *W*, thereby equating the scales for *N* and *FP*, an interesting phenomenon arises. We created a series of simulated (Monte Carlo) data sets that mimicked the important properties of a typical sequential testing data set. These simulated data sets were generated with a wide range of failure rates for the total test and with a wide range of correlations between the screen scores and the total test scores. For each of these data sets, we calculated the best cut point as determined by Muijtjen et al.'s model using loss function with our theoretically derived value of *W*, and the best cut point as determined by the ROC analysis minimizing the unweighted sum of the simple error rates. In these Monte Carlo simulations, regardless of the failure rate or the correlation of screen with the full test, we found that the best cut point generated by minimizing loss was always identical to the best cut point generated by minimizing the unweighted sum of the simple error rates in the ROC analysis. In fact, for any given simulation, the correlation between the ROC error rate and the loss function across the range of cut scores was always 1.0.

Thus, although the logic Muijtjens et al. use is interesting, their model involves the use of a weighting value that can vary widely and includes no theoretically sound reason for selecting one weighting value over another. When a theoretically derived method for selecting a value is provided (i.e., equating the scaling for their two forms of “error”: *N* and *FP*), their model, in fact, reduces almost exactly to a simplified version of the ROC model from a mathematical standpoint and generates the identical solution from a practical standpoint.

Further, there are several advantages to the ROC model. First, it begins with the situation where the proportional “cost” of a false negative is equal to that of a false positive. Thus, weighting one value or the other value to increase its relative cost is transparent. The value of the weight is the relative value of the error (a weighting of 2 indicates that one type of error is twice as valuable as the other). Second, extensive literature already exists for the ROC function that addresses issues such as the use of the sum of the simple error rates versus the sum of the squared error rates to determine the best minimum. We suggest, therefore, that despite Muijtjens et al.'s interesting arguments for the development of the loss function, the ROC model continues to be a useful and effective method for selecting the cut point in a sequential testing context.