Examinee Performance on Computer-based Case Simulations as Part of the USMLE Step 3 Examination: Are Examinees Ordering Dangerous Actions? : Academic Medicine

Secondary Logo

Journal Logo


Examinee Performance on Computer-based Case Simulations as Part of the USMLE Step 3 Examination

Are Examinees Ordering Dangerous Actions?


Editor(s): Norcini, John J. PhD

Author Information
  • Free

Computer-based case simulations (CCSs) have been used in addition to fixed-format items in USMLE Step 3 since November 1999. The ability to assess physicians' patient-management skills with simulation has added a new dimension to the physician licensure assessment. Each patient management case begins with an opening scenario describing the patient's location and presentation. Using free-text entry, the examinee orders tests, treatments, and consultations and selects physical examinations from a list of options while advancing the case through simulated time. Within the dynamic simulation framework, the patient's condition changes based on both the actions taken by the examinee and the underlying problem. More extensive detail regarding the CCS format1 and scoring2,3 may be found elsewhere.

The CCS free-text entry format and availability of over 2,500 unique clinical orders make it possible to assess the extent to which examinees make dangerous management errors. Although this potential has been touted as a significant advantage of the CCS format,1 no study has examined this behavior in the CCS context. Previous research4,5,6,7 on the selection of dangerous actions has, however, been conducted using written medical certification examinations. A driving force behind these studies was to provide insight into whether examinees' scores should be based, in part, on their propensity to select dangerous answers. The findings did not support such action. One study4 found dangerous actions to be so highly correlated with total test score that accounting for these actions would not provide additional information regarding examinee proficiency. In a second study,6 examinee performance, as measured by an oral examination and report by a clinical competency committee, indicated that selecting a disproportionately high number of dangerous answers on the written examination did not predict dangerous clinical thinking or behavior.

These studies have provided useful information regarding the extents to which different groups of candidates select dangerous actions. However, one limitation of examining this behavior in written examinations is that fixed-format items constrain the list of answer options. In contrast, examinees construct their own responses to CCS cases, which makes the potential to measure dangerous actions essentially unlimited.

The purpose of this research was to focus on dangerous interventions ordered by examinees while managing CCS cases. Specifically, it was of interest (1) to quantify the extent to which licensure candidates managing CCS cases order dangerous nonindicated interventions; (2) to understand how these types of examinee behavior relate to other measures produced in CCS scoring, and (3) to understand the nature of dangerous interventions ordered and their relationships to case content.


To examine the first question, “To what extent are examinees ordering inappropriate/intrusive nonindicated actions?,” examinee responses were selected from 11 CCS cases administered during the first 18 months of implementation of the computerized Step 3 examination. This included in excess of 76,000 examinee-case interactions, or an average of nearly 7,000 examinees per case. The system that supports scoring for CCS recognizes over 12,000 abbreviations, brand names, and other terms representing more than 2,500 unique actions. Within this system is a list of all nonindicated actions an examinee might order. Nonindicated actions are categorized into nine levels, with the first category representing the most benign actions and the ninth representing actions associated with a high degree of intrusiveness and much greater probability of morbidity or mortality. A committee of content experts determines which actions fall into each of the nine categories given the nature of the case. Whereas ordering urine or sputum tests are examples of inappropriate behaviors represented in category one, ordering mastectomy, cardiothoracic surgery, and intracranial surgery are examples of intrusive behaviors that fall into categories seven, eight, and nine, respectively (applicable to cases where these tests or treatments are unjustified by the underlying problem presented in the case). To quantify the extent to which examinees were ordering dangerous actions, counts of the numbers of different actions ordered by examinees in these nine categories were captured for each of the 11 cases.

In addition to quantifying the numbers of potentially dangerous actions ordered, the second research question pertained to the extent to which this aspect of examinee behavior was stable across cases, and the relationship between ordering these and other more appropriate actions. For this purpose, four measures were included that quantified the extent and timeliness with which appropriate actions were ordered. The first represents actions considered essential for adequate care; the second represents actions considered important for optimal care; the third includes actions considered desirable for optimal care; and the fourth reflects the time frame in which the essential actions were completed.

Three measures were used to represent categories of nonindicated actions at various levels of intrusiveness. The first category reflects actions that are inappropriate, but relatively nonintrusive and associated with minimal patient risk; the second reflects actions that are relatively intrusive or carry some risk; the third reflects the extent to which intrusive, relatively risky actions are ordered. The scores for these categories are based on counts of actions falling into the nine levels described previously: category one is the sum of actions ordered in levels one to three; category two is the sum of actions ordered in levels four to six; and category three is the sum of actions ordered in levels seven through nine. The three nonindicated-action categories and four appropriate-action categories considered in this study are consistent with the description of CCS scoring detailed in previous papers.2

To examine the relationships among these seven measures, three data sets were analyzed in which examinees had taken test forms that shared four common cases. (Although examinees in each group saw the same four cases, the cases contained in each set varied across the three groups). For each data set, internal consistency was estimated for all seven measures and correlations between individual measures were calculated. Correlations and reliability estimates were then averaged across the three data sets. Finally, true-score correlations were estimated by disattenuating for unreliability. Basing the estimates on blocks of common cases was necessary to produce a crossed design for analysis. However, because the results based on four cases are somewhat arbitrary, for reporting purposes the reliabilities were used to estimate the mean intercase correlation within each measure (i.e., the reliability of the measure based on the score from a single case).

To explore the third research question, a team of content and test-development specialists examined the relationships between clinical case characteristics and the numbers and severities of nonindicated actions requested by the examinees. The purpose was to understand the nature of dangerous nonindicated behaviors ordered and to make inferences about case characteristics that might be associated with differing frequencies of intrusive actions ordered by examinees. Particular attention was given to the infrequently ordered but highly-intrusive actions in categories seven through nine.


The proportions of examinees performing at least one nonindicated action ranged from 19% to 70% across the 11 cases, with an average of 45%. Examinees who ordered nonindicated actions performed, on average, 1.5 such actions per case. The maximum numbers of distinct nonindicated actions performed by an examinee on any given case ranged from 4 to 11 depending on the case (mean = 8.5, SD = 0.87).

The average number of nonindicated actions ordered by examinees in each of the nine categories was calculated for each case and then averaged across the 11 cases. This information is provided in Table 1. The column labeled “mean” represents the average number of actions ordered by a single examinee in managing a case. The column labeled “minimum” represents the average for the case with the lowest mean number of actions ordered in that respective category. “Maximum” is the equivalent value for the case with the highest mean. Since the average number of actions ordered was less than one in each of the nine categories, these results are more easily interpreted by translating the results into one action ordered per n examinees. For example, the average number of actions ordered in level-3 (0.10) could be interpreted as one action ordered for every ten examinees, rather than examinees' ordering an average of one-tenth of an action.

Nonindicated Actions Ordered in Computer-based Case Simulations by Individual Examinees According to Level of Intrusiveness (Averaged Across 11 Cases)

Not surprisingly, and fortunately for medical consumers, the actions that are ordered most frequently (level 2) are those that are inappropriate, but relatively nonintrusive and associated with little risk. Those that are most intrusive (category 9) were ordered only 25 times across the more than 76,000 case interactions studied. Similarly, those in category 8 were ordered only 22 times. The number of risky, nonindicated actions was substantially higher in category 7, where there were 671 actions, approximately half of which (354) were ordered for a single case. Among the examinees ordering actions in categories 7, 8, and 9, the proportions who passed Step 3 on this attempt were 55%, 45%, and 36%, and the proportions who had been unsuccessful at passing Step 3 on a previous attempt were 39%, 50%, and 44%, respectively. Three examinees requested actions falling into two of these three categories on a single case. All had taken Step 3 previously and failed on their current attempts. Although the characteristics of examinees in these three groups do not correspond exactly to those taking Step 3 in a given year, it is important to note that the annual Step 3 failure rate is 25%.

Table 2 presents the true-score correlations for the seven measures on the off-diagonals and the average intercase correlations on the diagonal. In general, measures based on beneficial actions correlate positively with each other, those based on nonindicated actions correlate positively with each other, and those based on nonindicated actions correlate negatively with beneficial actions. The exception is that the measure representing the least intrusive of the nonindicated actions correlates positively with the two measures that represent the less essential beneficial actions.

True-score Correlations* among Subcategories in Computer-based Case Simulations with Internal Consistency (Based on Four Cases) on the Diagonal; Estimates of Subcategories in Last Column

In addition to the quantitative analysis, the qualitative analysis revealed that actions ordered falling into categories 7, 8, and 9 can be grouped into five classes: (1) misdiagnosis resulting in overaggressive diagnostic study or treatment, (2) overaggressive diagnosis/treatment, consistent with a correct diagnosis, (3) suspect entry, where examinees most likely misunderstood the medication name or procedure, (4) other mismanagement where the examinee misdiagnosed the problem or severity of the problem, which resulted, for example, in the patient's being sent home at great risk, and (5) unexplained behaviors. Content experts also determined that examinees ordered dangerous actions in a variety of cases representing both chronic and acute problems, cases with indicated and nonindicated surgical interventions, as well as in outpatient and emergency settings.


The results in Table 1 suggest both that the frequency with which intrusive actions are ordered varies substantially across cases and that examinees managing CCSs order nonindicated actions at a nontrivial frequency. On average, examinees performed at least one nonindicated action in more than 40% of cases. Although the frequencies of nontrivial actions varied substantially across cases, the available data do not appear to relate the frequencies of such actions to other characteristics of the case. For some specific cases, examinees ordered actions associated with relatively high risks at an average rate of more than one action for every ten examinees who managed the case. For other cases, actions associated with this level of risk were essentially never ordered, and even actions associated with minimal risk were ordered at a rate of only about one order for every ten examinees.

The results reported in Table 2 provide interesting insights into examinee performance on CCSs. The perfect negative correlation between the three variables representing the most important beneficial actions (“timing,” “essential beneficial actions,” and “important beneficial actions”) and the variable representing the most dangerous of the non-indicated actions suggests that these very dangerous actions are typically performed by individuals who are off track and have completely missed the diagnosis. This theory is also supported by previous research,5 which has suggested that selection of dangerous actions results from a lack of knowledge.

The one unexpected result from Table 2 is the positive correlation between the variable representing the least intrusive nonindicated actions and those representing the two levels of less essential beneficial actions. The fact that less essential beneficial actions correlate positively with this count of nonindicated actions while the variable representing essential actions correlates negatively suggests that this positive relationship is not the result of a shotgun approach to diagnostic testing which occasionally hits the target. Because the relationship exists only for the less essential beneficial actions, a more likely explanation is that the relationship results from a tendency of some examinees to be more thorough in their testing and treatment (or at least more active). Thoroughness could result in a greater proportion of the nonessential beneficial actions' being taken, but because these examinees may lack complete mastery of the content, they are also performing greater numbers of actions that are completely inappropriate in the context of the specific case.

The results of this study suggest that the frequency with which nonindicated actions are ordered can provide important information about an examinee's proficiency. A large proportion of examinees ordering the most dangerous and intrusive actions had difficulty passing Step 3. This finding is not surprising given the negative correlation between dangerous actions and total test score found in previous research.4 Furthermore, the regression-based scoring rubric used for the CCS decreases the potential for examinees who make dangerous management mistakes to perform well on Step 3, because the negative weight associated with the nonindicated-action categories penalizes examinees for ordering nonindicated actions.

Establishing validity is critical for drawing appropriate inferences from test scores. With regard to dangerous actions, previous research5 has not established a relationship between scores based on these behaviors and other measures of clinical reasoning assessed during residency. This lack of concurrent validity could, however, be the result of using fixed-format item types that restrict the number of options available. In addition to concurrent validity, predictive validity is also important to consider. The evidence reported in this paper does not provide a basis for this type of inference, but it does confirm that when examinees are given the opportunity to order inappropriate tests and treatments, some will. Additional research is still needed to determine the extent to which this information provides evidence about examinee behavior in practice.


1. Clyman SG, Melnick DE, Clauser BE. Computer-based case simulations from medicine: assessing skills in patient management. In: Tekian A, McGuire CH, McGahie WC (eds). Innovative Simulations for Assessing Professional Competence. Chicago, IL: University of Illinois, Department of Medical Education, 1999:29–41.
2. Clauser BE, Margolis MJ, Clyman SG, Ross LP. Development of automated scoring algorithms for complex performance assessments: a comparison of two approaches. J Educ Meas. 1997;34:141–61.
3. Clyman SG, Melnick DE, Clauser BE. Computer-based case simulations. In: Mancal EL, Bashook PG (eds). Assessing Clinical Reasoning: The Oral Examination and Alternative Methods. Evanston, IL: American Board of Medical Specialties, 1995:139–49.
4. Grosse ME. Scores based on dangerous responses to multiple-choice items. Eval Health Prof. 1987;9:459–66.
5. Mankin HJ, Lloyd JS, Rovinelli RJ. Pilot study using “dangerous answers” as scoring technique on certifying examinations. J Med Educ. 1987;62:621–4.
6. Slogoff S, Hughes FP. Validity of scoring “dangerous actions” on a written certification examination. J Med Educ. 1987;62:625–31.
7. Kremer BK, Mankin HJ. A follow-up study of “dangerous answers” in four medical specialties. Eval Health Sci. 1990;13:489–503.
© 2002 by the Association of American Medical Colleges