The objective structured clinical examination (OSCE), while highly successful as a tool for evaluating clinical skills, tends to be expensive and time-consuming.1 In an effort to reduce the cost and time required to generate reliable and valid scores on the OSCE, several authors have recommended the use of sequential testing.2,3,4,5,6,7,8 Sequential testing involves all candidates' taking a short screening test consisting of a subset of stations from the total examination. Based on scores from this screening test, predictions are made about each student's likely performance on the full test. Candidates who are identified as highly likely to pass the full test (“clear passes”) are passed on the basis of the screen alone and are exempted from the remainder of the examination. Candidates who are not identified as “clear passes” take the full examination and pass/fail decisions for these candidates are based on this more stable measure of the candidates' abilities.
Clearly, the utility of the screen depends on its ability to accurately identify a large proportion of the candidates who would pass the full test. However, when the correlation between the screen score and the total test score is less than perfect, there is a tradeoff between the accuracy of this prediction and the proportion of candidates who are exempted from the total examination. When the correlation between the screen and the full test is moderate, the test administrator must either set a high standard on the screen, forcing a large number of individuals who will pass to nonetheless take the full test, or accept that a substantial proportion of the candidates who would fail the full test will be inappropriately passed on the basis of the screen.
Traditionally, the receiver operating characteristic (ROC) curve has been used as the method for determining the optimal cut point on the screen.2,3,4,5,6,7 The ROC method is applied to a previous data set in which all candidates have taken both the screen and the full test. Using this type of data set, a series of cut points for the screen is selected that range from very strict (almost no candidates are passed on the basis of the screen alone) to very lenient (almost all candidates are passed on the basis of the screen alone). For each cut point, two types of errors are calculated: the false-positive rate (the proportion of candidates who would have been passed by the screen, but who would fail the full test) and the false-negative rate (the proportion of candidates who would have had to take the full examination and would pass the full examination). When the cut point selected for the screen is very strict, the false-positive rate is quite low but the false-negative rate will spiral quickly out of control. When the cut point selected is very lenient, the false-negative rate is low but the false-positive rate spirals quickly out of control. A midrange cut point balances the two forms of error. Authors have suggested a variety of method for balancing these two forms of error. Yet, until fairly recently, nobody questioned the underlying premise of the ROC method that these are the two forms of error that ought to be balanced.
Thus, the work of Muijtjens and his colleagues8 in which they introduced a new method for evaluating the optimum cut score for the screening test is interesting. In developing their rationale for introducing this new method, Muijtjens et al. argue that the appropriateness of classic ROC analysis is questionable. Specifically, they suggest that the traditional use of the false-negative rate as a measure of errors is inappropriate in the context of sequential testing. Their claim is based on the fact that all who fail the screen take the full test, so every candidate who should pass the test does. That is, nobody is inappropriately failed based only on his or her performance on the screen, and therefore the false-negative rate is functionally zero for the sequential testing procedure. Instead, they suggest, because all candidates who fail the screen must complete the entire test, it would be more appropriate to treat all individuals who fail the screen as an equal source of inefficiency in the screen regardless of whether they pass or fail the full test. The greater the number of individuals who must take the entire test, the less efficient the screen. This assertion, of course, is balanced against their recognition that the more efficient the test (the fewer the number candidates who take the entire test), the higher the likelihood of false positives (people who are passed on the basis the screen but who would have failed had they taken the entire test). Thus, whereas the classic ROC analysis attempts to minimize the overall error rate (false-negative rate + false-positive rate), Muijtjens et al.'s model attempts to minimize the combination of inefficiency and false positives (what they call the loss function).
We were intrigued by the theoretical argument Muijtjens et al. have made and we wanted to examine the impact of their formulation on decisions that might be made relative to the traditional ROC model. Interestingly, when we tried comparing their formulation based on the loss function and the ROC formulation, we were surprised to find that when certain theoretically derived conditions are met, the loss function introduced by Muijtjens et al. leads to results that are functionally identical to those obtained by the classic ROC analysis. This commentary is an effort to share our findings and describe our understanding of the reasons for this equivalence. To do so, we use a nomenclature based on Table 1.
Using this nomenclature, classic ROC analysis would describe the false-positive rate as the proportion of candidates who would fail the total test but who nonetheless pass on the basis of the screen:
and the false-negative rate as the proportion of candidates who would pass the total test but who nonetheless fail the screen and therefore must take the full test:
Thus, the calculation for the total error rate is:
and the best cut point is the score that minimizes the total error rate.
It is worth noting that, consistent with the traditional use of ROC analysis, Colliver et al.3 and Cass et al.6 use the minimum of the sum of the squared error rates to determine the best cut point, but for the purposes of demonstration here, we describe the total error rate as the sum of the simple error rates.
It is also worth noting that, in the formulation described above, the proportional “cost” of a false-negative result is equal to the proportional “cost” of a false-positive result. This need not be the case. It might be argued that, in fact, it is more “costly” to inappropriately pass a candidate on the basis of the screen alone than to unnecessarily ask an individual to sit the full examination. To accommodate for the relative importance of the two types of errors, one could weight them differently (by having a multiplier for one value or the other). The value of the weight is the relative value of the error (a weighting of two indicates that one type of error is twice as valuable as the other). When one error type is weighted more heavily, changes in that error rate will have more impact on total error rate than changes in the other, and the cut point that minimizes the total error will shift accordingly to differently minimize the more important (more heavily weighted) error. For the current demonstration, however, we continue to weight the two types of errors equally at 1.
For the loss function Muijtjens and his colleagues introduced, inefficiency (N) is calculated as the raw number of candidates who failed the screen and had to take the entire test (c + d), and the false positive (FP) is defined as the raw number of candidates who would have failed the total test but nonetheless passed the screen (b). They add these two components, but multiply the false positives by a weighting factor (W). Thus, loss is calculated as:
or, using our nomenclature:
The weighting factor used here serves the same purpose as the weighting factor described for classic ROC analysis above, providing differential importance to the two forms of error. It should also be noted that the value (1 + W) is used merely as a scalar to ensure that the weights associated with N and FP sum to 1 for all values of W (although this scaling factor has no implication for selecting the best cut point for a given weight). As with the ROC analysis, the best cut point is the one that minimizes the loss function for a given value of W.
The core of our argument for the equivalence of the two models (the ROC model and the model using the loss function) lies in the fact that, in their article, Muijtjens et al. provide no justification for selecting a value for W. In fact, they state explicitly that “it is difficult to decide what the value of W should be, because it involves an arbitrary judgment,”8, p.371 and they select a variety of arbitrary weights, from 1 to 128, in their paper. The selection of a weight appears particularly arbitrary in the loss function because N and FP are calculated in raw numbers and, therefore, do not have any numeric equivalence in the equation.
Thus, we suggest a very sensible, theoretical method for selecting an initial value of W. Because N and FP are calculated as raw scores, one good reason for having a weighting system is to place the values of N and FP on the same scale, which, in essence, gives the values of N and FP equal weight. The most practical method for placing both N and FP on the same scale is to proportionalize each value, to have each of the measures range from 0 to 1. To proportionalize their FP in this way, we would divide their FP by the maximum FP possible. Theoretically, if all who fail the overall test were, nonetheless, passed by the screen, their FP would be equal to the total number of people who fail the overall test. Thus, the maximum FP value is the total number of test failures (b + d), and the proportionalized FP would be calculated as
Likewise, to rescale N to a scale of 0 to 1, we would divide N by the maximum possible N. Theoretically the screen could fail all candidates, so the maximum N is the total number of test takers (a + b + c + d). Therefore the proprotionalized N would be calculated as
Of course, in the original formulation of loss, N is unweighted. To maintain the formula's original structure, we multiply the proportionalized values of both N and FP by the total number of test takers. Thus, to place N and FP on the same scale (a scale ranging from zero to the total number of test takers) we would calculate W to be
Functionally, when we used this value of W, thereby equating the scales for N and FP, an interesting phenomenon arises. We created a series of simulated (Monte Carlo) data sets that mimicked the important properties of a typical sequential testing data set. These simulated data sets were generated with a wide range of failure rates for the total test and with a wide range of correlations between the screen scores and the total test scores. For each of these data sets, we calculated the best cut point as determined by Muijtjen et al.'s model using loss function with our theoretically derived value of W, and the best cut point as determined by the ROC analysis minimizing the unweighted sum of the simple error rates. In these Monte Carlo simulations, regardless of the failure rate or the correlation of screen with the full test, we found that the best cut point generated by minimizing loss was always identical to the best cut point generated by minimizing the unweighted sum of the simple error rates in the ROC analysis. In fact, for any given simulation, the correlation between the ROC error rate and the loss function across the range of cut scores was always 1.0.
Thus, although the logic Muijtjens et al. use is interesting, their model involves the use of a weighting value that can vary widely and includes no theoretically sound reason for selecting one weighting value over another. When a theoretically derived method for selecting a value is provided (i.e., equating the scaling for their two forms of “error”: N and FP), their model, in fact, reduces almost exactly to a simplified version of the ROC model from a mathematical standpoint and generates the identical solution from a practical standpoint.
Further, there are several advantages to the ROC model. First, it begins with the situation where the proportional “cost” of a false negative is equal to that of a false positive. Thus, weighting one value or the other value to increase its relative cost is transparent. The value of the weight is the relative value of the error (a weighting of 2 indicates that one type of error is twice as valuable as the other). Second, extensive literature already exists for the ROC function that addresses issues such as the use of the sum of the simple error rates versus the sum of the squared error rates to determine the best minimum. We suggest, therefore, that despite Muijtjens et al.'s interesting arguments for the development of the loss function, the ROC model continues to be a useful and effective method for selecting the cut point in a sequential testing context.
1. Cusimano MD, Cohen R, Tucker W, Murnaghan J, Kodama R, Reznick R. A comparative analysis of the costs of administration of an OSCE. Acad Med. 1994;69:571–6.
2. Colliver JA, Mast TA, Vu NV, Barrows HS. Sequential testing with a performance-based examination using standardized patients. Acad Med. 1991;66(9 suppl):S64–S66.
3. Colliver JA, Vu NV, Barrows HS. Screening test length for sequential testing with a standardized-patient examination: a receiver operating characteristic (ROC) analysis. Acad Med. 1992;67:594–7.
4. Colliver JA, Markwell SJ, Travis TA, Schrage JP, Vu NV. Sequential testing with a standardized-patient examination: an ROC analysis of the effects of case—total correlations and difficulty levels of screening test cases. In: Cohen R, Rothman A (eds). Proceedings of The Sixth Ottawa Conference on Medical Education. Toronto, ON, Canada: University of Toronto Bookstore Custom Publishing, 1994:170–3.
5. Rothman AI, Ross J, Cohen R, Poldre P. Sequential testing in clinical skills assessment. Presented at the Fifth Ottawa International Conference on Assessment of Clinical Performance, Dundee, Scotland, September 1–3, 1992.
6. Cass A, Regehr G, Reznick R, Rothman A, Cohen R. Sequential testing in the objective structured clinical examination: selecting items for the screen. Acad Med. 1997;72(10 suppl 1):S25–S27.
7. Rothman AI, Blackmore DE, Dauphinée WD, Reznick R. Tests of sequential testing in two years' results of Part 2 of the Medical Council of Canada Qualifying Examination. Acad Med. 1997;72(10 suppl 1):S22–S24.
8. Muijtjens AMM, van Vollenhoven FHM, van Luijk SJ, van der Vleuten CPM. Sequential testing in the assessment of clinical skills. Acad Med. 2000;75:369–73.