# Classification Accuracy for Tests That Allow Retakes

Correspondence: Brian Clauser, EdD, National Board of Medical Examiners, 3750 Market Street, Philadelphia, PA 19104-3190.

When tests are used to make classification decisions, estimates of decision consistency and false-positive and false-negative error rates may be more appropriate than reliability as a means of characterizing precision. The decision-theoretic framework that supports this approach is not new,^{1,2} and numerous authors have recommended indices for use in this context.^{3} However, one aspect of this framework that has particular significance for licensure and certification testing and yet has received relatively little attention is the impact of retakes.

When an examinee receives a passing score on a licensure test, a final classification is typically made. This may be a correct classification or it may be incorrect. (In this paper, when a non-proficient examinee is classified as proficient, it will be referred to as a false-positive error.) When an examinee receives a failing score, the examinee will usually be given the opportunity to repeat the test. An ultimate failing classification will occur only when the examinee has given up or exhausted the allowable retake opportunities. Millman^{4} described this effect and noted that this process provides necessary protection for the examinee. The examinee's proficiency is not measured without error. When a proficient examinee fails a licensure examination due to measurement error, it is appropriate that the examinee should have an additional opportunity to demonstrate proficiency. However, providing additional opportunities for testing not only corrects errors that would penalize proficient examinees but also creates errors that favor non-proficient examinees. These errors may put the public at risk by allowing unqualified candidates to become licensed or certified.

This paper presents a theoretical framework describing the factors that influence classification error rates over multiple administrations of a test. It then considers some of the strategies that are available for controlling the false-positive error rate. The main emphasis of the paper is to provide the reader with a realistic sense of the magnitude of the inflation of false-positive errors that may result when retakes are allowed and to provide a framework for considering what, if anything, should be done to control this inflation.

A starting point for this discussion is to ask whether allowing retakes has a significant impact on the classification error rate that occurs under typical practice conditions. Consider the example of a test with a reliability of .92, applied to a normally distributed examinee group where 10% of examinees are non-proficient. If the cut-score is established to fail 10% of examinees (in the population), with a single administration approximately 2% of all examinees might be expected to be misclassified as proficient. (Since 10% of the examinees are non-proficient, this is a false-positive rate of 20%). After two retakes this rate is doubled. A false-positive rate this high based on a single administration would be expected from a test with a reliability of approximately .69. If the primary purpose of a licensure examination is protection of the public from non-proficient practitioners, it is clear that, in the circumstances described in this example, retakes do significantly impact the classification error rate, and thus the effectiveness of the test.

## A Psychometric Framework

For a given population and definition of proficiency, the cumulative effect of multiple retakes on the classification error rate can be described as a function of (1) the test reliability, (2) placement of the cut-score, and (3) the number of retakes. Decision theory argues that the relative cost of false-positive and false-negative errors must also be considered.

The impact of these influences on the false-positive rate can be expressed in terms of equations, but specific examples give a more intuitive view of these effects. For these examples, it is assumed that the proficiency of interest is normally distributed and that, consistent with the classic test theory model, each test score is influenced by the examinee's proficiency and random error. For this discussion it is assumed that examinee proficiency does not change between test administrations. In high-stakes settings, it is likely that some failing examinees will prepare for retakes and in the process change their proficiencies. Of course, those examinees who raise their proficiency levels sufficiently to become “proficient” will no longer represent false-positive errors when they are subsequently classified as such. It is also possible that some examinees will forget part of what they have learned in training, thereby lowering their proficiency levels. The theoretical framework presented in this paper is intended to enlighten decisions about practical assessment settings by describing how the effects of measurement error may be compounded across retakes. Although in practice examinees may change their proficiencies over time, to the extent that this occurs it changes the question about classification accuracy that this paper is intended to address.

In Figure 1, results are presented for tests with reliabilities of .60, .70, .80, and .90. The values represent the false-positive rates that would be expected after one, two, three, and four administrations of a test to a population of examinees in which 10% actually lacked the required proficiency. The results indicate that more than 20% of the non-proficient examinees would be expected to pass on the first administration even with a reliability of .90. By the fourth administration (third retake) more than half of the non-proficient examinees would have passed.

Consistent with expectations, test reliability does have a substantial impact on classification accuracy. However, in each case, allowing even a single retake increases the false-positive rate more than a reduction of .10 in test reliability. With the highly respectable reliability of .90, allowing three retakes increases the false-positive rate to a point well beyond what would be expected after a single administration with a test of a reliability of .60. To provide a perspective on what a drop in reliability of this magnitude represents, if a test with reliability .90 had 400 items, to achieve a reliability of .60 would require fewer than 70 items.

In Figure 1, the results are based on a population in which 10% of the examinees are non-proficient. The cut-score is established to fail 10% of the first-time takers. The top panel in Figure 2 presents results for the same population for a test with reliability of .90. The four lines represent the false-positive rates based on four different cut-scores. The cut-scores are established to fail 5%, 10%, 15%, and 20% of first-time takers. As the cut-score is increased, the false-positive error rate drops dramatically, even for multiple retakes. The middle panel in Figure 2 shows that as the false-positive rate drops, the false-negative error rate (i.e., the proportion of truly proficient examinees failing the test) rises.

The top and middle panels of Figure 2 provide a context for introducing the decision-theoretic framework. These figures illustrate that the cost of reducing either type of error will be an increase in the other. Decision theory argues that an optimal decision is one that minimizes the cost of the errors. In the present situation, the placement of the cut-score might be considered optimal when the total cost associated with false-positive and false-negative errors is minimized. The cost associated with false-positive errors will equal the product of the false-positive rate, the proportion of non-proficient examinees in the population, and the relative cost of a false-positive error. The cost associated with false-negative errors is calculated analogously.

The bottom panel in Figure 2 presents total error rates from the conditions represented in the top and middle panels. If the relative costs of errors are equal, the cost is minimized when this rate is minimized. In a system in which all failing examinees retake the examination until they have passed or completed four administrations, the bottom panel of Figure 2 suggests that the costs are minimized by establishing the standard so that 15% of examinees fail on the first administration. In the likely case that the relative costs of false-positive and false-negative errors are different, identifying an optimal strategy will require weighting the error rates by the relative costs.

The third factor affecting the false-positive error rate is the number of retakes. Figures 1 and 2 present results based on a maximum of four administrations for any examinee. It is apparent from the results that the largest increases in false-positive error rates occur with the initial retakes. Under the conditions represented in Figure 1, after four administrations the false-negative rate is effectively 0%; additional administrations cannot further reduce this value. The false-positive rate does, however, continue to rise. Although the rate of increase slows, there is still a non-trivial increase resulting from subsequent administrations, with an increase of more than 2% occurring after the seventh retake.

## Strategies for Controlling False-positive Errors

The results presented so far suggest that there is good reason to consider strategies to control the false-positive error rate that may result from multiple retakes. One important strategy is to limit the number of retakes. Under the conditions represented in Figure 2, with the cut-score labeled 10%, allowing for two retakes will reduce the false-negative error rate to a negligible level. Allowing for additional retakes will substantially increase the false-positive rate without meaningfully changing the false-negative error rate. Policy could permit individual review of exceptional cases, but in general It is difficult to argue in favor of retakes that will primarily act to reduce the validity of the resulting classifications.

Figure 2 suggests another strategy—raising the initial cut-score. Using the more conservative cut-score (labeled 15% in Figure 2) has the effect of substantially increasing the proportion of proficient examinees failing based on the first administration, but after three retakes the false-negative error rate is reduced to 0.7%. With this cut-score, the false-positive rate after three retakes is 26%. Comparing these results with those for the cut-score labeled 10% shows that for a system in which three retakes are allowed, the higher cut-score increases the false-negative error rate by about 0.6% and reduces the false-positive error rate from 53% to 26%. As the bottom panel in Figure 2 shows, this strategy reduces the overall misclassification rate substantially. If the costs are seen entirely in terms of ultimate misclassification errors (i.e., if the fact that some additional proficient examinees had to test more than once before they were correctly classified is ignored), this strategy will be advantageous under any condition in which the relative cost of false-positive errors relative to false-negative errors is less than about five to one. If protection of the public is considered the highest priority, so that false-positive errors are considered more costly than false-negative errors, this strategy will be even more effective than the rates in the bottom panel of Figure 2 suggest.

Millman^{4} suggested two alternative strategies for controlling the false-positive error rate across retakes: (1) the classification for examinees with retakes could be based on an average rather than a single test score or (2) the cut-score for retakes could be raised. For both of these approaches, false-positive rates were determined for the conditions presented in Figure 2, using a 10% cut-score. Three variations on Millman's first strategy were examined: one in which the classification was based on the average of all scores for an examinee, one in which decisions were based on the average of the last two scores, and one in which they were based on an average of the two best available scores. The associated false-positive rates after a maximum of three retakes were 33%, 40%, and 44%, respectively. The corresponding false-negative rates were 0.4%, 0.1%., and 0.1%.

Three variations on Millman's second strategy were examined by raising the cut-score by .25 SDs, for the first retake, for the first two retakes, or for each retake. After three retakes, the associated false-positive rates were 37%, 30%, and 29%, respectively. The corresponding false-negative rates were 0.3%, 0.8%, and 0.9%.

Which strategy will be preferred will depend on a number of considerations. Although it is the case that in general reduction in the false-positive rate will be accompanied by an increase in the false-negative rate, the total cost of errors will vary with the relative costs of these two types of errors. Additionally, policy considerations will come into play. Even in a circumstance in which one of the strategies proposed by Millman may appear optimal in reducing the cost of classification errors, applying different rules to retakes may be judged unacceptable. Although some strategies to control classification error may meet with political resistance, it should be noted that approaches such as limiting the number of retakes are currently implemented for some testing programs and have apparently survived (or avoided) political and legal challenge. For example, many jurisdictions currently limit the number of retakes permitted for the United States Medical Licensure Examination.^{5}

## Conclusion

The primary purpose of this paper has been to highlight the impact on classification errors that result from allowing failing examinees to retake licensure and certification examinations. One general conclusion that can be drawn is that to properly evaluate a testing procedure used for classification, the performance of the procedure across retakers must be considered.

Professional test-development organizations routinely invest significant resources in the production of highly reliable examinations. However, the precision that results from these efforts is systematically undermined when non-proficient examinees have multiple opportunities to take the examination. This paper has examined approaches that may help to control the false-positive rate. No single approach can be recommended as optimal for every setting.

Two features of the analysis presented in this paper should be highlighted. Throughout the paper, modeling has been based on classic test-theory assumptions applied to a normally distributed examinee population. It is unlikely that any realistic, alternative assumptions would lead to significant differences in the general conclusions. It should not, however, be assumed that these results are directly applicable to conditions in which the assumptions are not viable.

Additionally, throughout the paper the discussion of misclassification errors and associated costs has treated proficiency as though it were categorical. That is, there has been no differentiation between the case in which a nearly proficient examinee is misclassified as proficient and the case where an examinee of much lower proficiency is misclassified as proficient. Including this distinction in the analysis makes the loss function associated with misclassification errors considerably more complex. Both Cronbach and Gleser^{1} and Wald^{2} provide sophisticated discussions of this issue. Again, it is unlikely that in practical settings this factor will substantially alter the interpretation of the general results presented in this paper. In this context, it should also be noted that the compounding effects of measurement errors resulting from retakes may be similarly present when a test is used to produce a score rather than a classification. If examinees are permitted to retake an examination but only their highest scores are recorded, measurement error may result in a systematically biased estimate of proficiency.

## References

### Section Description

Research in Medical Education: Proceedings of the Fortieth Annual Conference. November 4–7, 2001.