Boulet, John R. PhD*; Murray, David MD†; Kras, Joseph MD†; Woodhouse, Julie RN†
Today, patient safety initiatives demand that physicians are able to recognize and manage specific conditions, especially those where delayed diagnosis and inadequate treatment contribute to adverse patient outcomes.1,2 Ultimately, physicians must be assessed on skills that are directly relevant to patient care.3–5 Historically, evaluations were targeted primarily at medical knowledge and the application of this knowledge to patient care activities. Despite the limited fidelity and sometimes questionable generalizability of these assessments, scoring systems are well developed, and performance standards are relatively easy to establish. In contrast, the complex skill sets required in clinical settings are often difficult to evaluate, and normally require the input of multiple experts to develop meaningful, evidence-based, performance criteria. Recently, with improvements in assessment methodologies and measurement techniques, other patient care domains (eg, doctor-patient communication) are being evaluated. Performance-based assessments designed to measure these domains, typically involving standardized patients (SPs), are now included as part of high-stakes certification and licensure examinations.6 If the measurement rigor that went into the development and validation of these types of assessments could be extended to other simulation-based modalities, including those involving part-task trainers and mannequins, a variety of the more complex and interrelated skills required in advanced practice settings could also be evaluated. More importantly, provided that performance standards and associated cut scores could be derived, meaningful competency judgments could be made.
When simulation-based tasks are incorporated in assessment activities, it is often necessary to set standards. For situations where the evaluation is used for formative training, or simply rank-ordering candidates to identify the lowest and highest scoring individuals, a normative framework can be used. For example, one could set a numeric standard where individuals at the 10th percentile or lower, based on their scores, are assigned a failing status. However, in a summative framework, where valid classifications of examinee ability or competence are needed, standards must be set relative to some criteria or defined level of performance. Here, the performance criteria must be established, usually via expert consensus, and then used in combination with the assessment scores to generate a cut point on the score scale that differentiates individuals who are, and who are not, qualified. For example, an assessment could be constructed to evaluate whether a resident needs supplementary training or, at the other end of the spectrum, whether a physician is ready for additional patient care responsibilities. For these types of evaluations, especially those designed to measure skills needed in practice, there is a definite need to determine the score or scores that separate those who possess adequate skills from those who do not. Although prior studies indicate that valid and reliable decisions about examinee or candidate proficiency can be made for performance-based assessments,7,8 these investigations were based primarily on examinations that used SPs. Compared with SP-based encounters, which are typically modeled on common patient presentations, individual mannequin-based simulation scenarios tend to be more specialty-specific and shaped on less frequently occurring events. In addition, because timely, sequential, patient management strategies are often necessary, they can also be more difficult to score. Nevertheless, given the comparable fidelity and structure of both types of simulations, the application of previously validated SP-based standard setting protocols to mannequin-based evaluations, although yet untested, and certainly dependent on the validity of the scores, should, if properly implemented, yield defensible cut scores.
There are a variety of criterion-referenced standard setting procedures that can, and have been, used for performance-based assessments.9–11 These methods, which involve setting a standard relative to some predefined performance level (the criterion), can be broadly classified as either test centered or examinee centered. In general, the test-centered methods require subject matter experts to make judgments concerning the expected performance of minimally competent examinees on select tasks. The Angoff procedure, and associated modifications, can be used to set standards on the checklists or key actions typically used for scoring simulated encounters.12 Here, the panelists are required to make a judgment as to the probability of a minimally qualified examinee performing (correctly) the indicated examination maneuver. As an example, based on a typical myocardial ischemia (MI) scenario, one could ask the standard setting panelists what percentage of minimally qualified examinees would (a) request and review the electrocardiogram, (b) administer nitroglycerin or (c) order a beta blocker. If these key actions, among others, are used as the performance measures, the average of the panelists’ judgments over all items can be used as the cut score. The examinee-centered methods can also be used to establish standards for simulated scenarios. Instead of judging actual test materials (eg, checklist content, individual items), the panelists would view a series of examinee performances, usually via audio-video recording, and make judgments concerning examinee proficiency or competence. The task may involve distinguishing qualified from unqualified examinees or simply identifying “borderline” performances. For the former, known as the contrasting group method, the intersection of the distribution of scores of qualified and unqualified examinees can be used to delimit the cut point. For the latter, the mean or some other measure of centrality of the scores for the borderline group would define the cut score.13
The choice of standard setting method to be used for a simulation-based examination will, to some extent, depend on the purpose of assessment and the availability of resources to conduct the exercise. Although a number of standard setting methodologies have been proposed and implemented for performance-based assessments, test-centered methods, where task- or item-specific judgments regarding expected examinee performance are solicited, have been shown to be awkward and time consuming. For performance-based tasks, it is extremely difficult and cumbersome, even for subject matter experts, to estimate the likely performance of a hypothetical minimally qualified examinee. In contrast, examinee-centered methods, which require expert judgment of the adequacy of select performances, have been implemented successfully for SP assessments, yielding consistent, realistic, and defensible standards.7 Here, it is relatively straightforward to have panelists use their expertise to judge actual performances as opposed to the, at times, debatable content of a case checklist or some other evaluation rubric. For these reasons, at least for performance-based examinations with a minimal number of tasks (exercises), the use of examinee-centered standard setting methods is currently favored.
Although summative performance-based assessments have existed for some time in medicine and are currently used as part of certification and licensure processes, they typically involve SP cases and associated, often computer-based, postencounter exercises. The scoring systems are well developed,14 and within a standard setting framework, the task of categorizing performances as adequate/inadequate or qualified/unqualified is relatively straightforward.15 For other simulation modalities (eg, mannequins, part-task trainers), especially for specialty-based assessment, there has been some work undertaken to develop and validate scoring systems, but the task of delineating minimum performance levels and setting numeric assessment standards has not been accomplished.16–19 Before considering the use of these simulation modalities in summative, high-stakes assessments, more detailed investigations of standard setting techniques need to be conducted. Although the methodologies previously adopted for SP assessments are likely to be of great value, mannequin-based scenarios are generally more complex and often demand time-sensitive patient management techniques and/or sequenced actions.20 As a result, it is not clear whether the expert panelist judgments (eg, qualified/unqualified), necessary to estimate the cut scores and judge their quality, will relate to the actual assessment scores. To the extent that they do, it should be possible to set meaningful, defensible, numeric standards. Moreover, as a byproduct of this process, one can procure evidence to support the validity of the scores and assess the quality of the simulation scenarios.21
The purpose of this study was to investigate whether the use of an examinee-centered standard setting method, using expert judgments of the adequacy of patient care for select performance samples, could be used to establish cut scores on a prototype performance-based multiscenario examination of anesthesiology skills.
The examinee-centered standard setting methodology used here requires performance samples (DVDs) covering the ability continuum for each individual scenario, associated scores, and trained panelists to provide expert judgments about the quality of the patient management activities.
Previously, as part of a formative evaluation exercise, several simulation scenarios were developed and tested with first-, second-, and third-year residents (Table 1). Additional details about the structure of the individual simulation scenarios, the psychometric properties of the scores, and how the assessment was administered are available elsewhere.16,20 A brief overview of the steps taken to develop and score these simulation exercises is presented at the top of Figure 1. In short, scenario content was chosen by a faculty panel, scoring systems were developed, piloted, and validated, residents were assessed under standardized conditions, and a comprehensive bank of audio-video recordings and associated scores was produced. The performance samples (audio-visual recordings) for the standard setting exercise were selected from residents who previously participated in the simulation-based assessment. All of the residents had provided written consent to participate in this institutionally approved assessment and to have their performances scored by faculty.
Although a number of scoring modalities have been proposed for simulation-based exercises, case-based key actions, listed in Table 1, were selected based on ease of scoring and their general ability to discriminate between low- and high-ability trainees. These case-based key action rubrics were previously developed by faculty panels and reflect the important actions that, given the patient’s condition, should be completed in each of the simulated encounters. The resident’s score for a given scenario is simply the sum of the key actions completed in the available time period (5 minutes).
Nineteen panelists were recruited from 2 Midwestern universities to constitute the standard setting panel. Eleven of the 19 panelists received residency training in the United States and were certified by the American Board of Anesthesiology. The other 8 were visiting faculty with a minimum of 5 years of postgraduate training in anesthesia as well as anesthesia certification from a foreign country; the visiting faculty had clinical practice experience and had supervised residents at the teaching hospital for a period ranging from 6 to 18 months. Twelve of the 19 panelists practiced in a large, tertiary care teaching hospital. Two of these 12 had fellowship training and were primarily engaged in specialty practice (obstetric and cardiovascular anesthesia). Seven of the panelists were practicing pediatric anesthesia in a pediatric teaching hospital. As faculty, some of the panelists had trained the residents whose assessment data and audio-video recordings were being used in the standard setting activities. However, none of these panelists was aware of when, in the training cycle (eg, CA-1, CA-2, CA-3), the performance samples were obtained.
Each panelist filled in a demographic and exit survey. Demographic data collected included age, gender, previous simulation experience, and prior standard setting experience. The mean age of the panelists was 42.6 years (min = 29, max = 52). Six (32%) of the panelists were female. All but 3 had previous simulation experience, either as a part of teaching duties, as a participant in some exercise, or in the capacity of developing scenarios. Only one panelist had had previous standard setting experience.
The exit survey was used to collect information on how the panelists viewed the standard setting process and how comfortable they were with providing qualified judgments for the individual scenarios. The panelists were asked to identify any scenarios they found difficult to make expert judgments about and to provide suggestions for making the standard setting process better. They were also asked to estimate the percentage of graduating residents within current postgraduate training programs in the United States who are ready for independent practice.
A general overview of the standard setting process is presented at the bottom of Figure 1. Once the standard setting method was chosen, the panelists were selected and oriented to the purpose of the assessment and the structure of the individual simulation exercises. Next, based on the competency domain (what would be expected of an anesthesiologist who had completed residency and was ready for independent practice), the experts were provided with descriptions of the qualified and unqualified examinee. These descriptions included aspects of practice such as recognizing and treating intraoperative emergencies, developing appropriate anesthetic plans for surgical procedures, and providing consultations for perioperative patient management. The descriptions were discussed and modified based on a collective opinion concerning the appropriate standard of care for a graduating resident entering clinical practice. The process of describing the minimally qualified graduating resident was undertaken to ensure that the standard setting panelists had a common appreciation of the performance level expected for entry into unsupervised practice, and not some other level (ie, board certification).
The rating process for the first scenario (Bronchospasm) was conducted as a single group exercise (see below). For this scenario, the panelists were allowed to briefly discuss, and defend, their judgments after viewing each performance. This conversation was allowed to provide a final opportunity for panelists to normalize their performance expectations with respect to the minimally qualified graduating resident. For the other scenarios, once the panelists were oriented to the nature of the simulation, the judgments were made independently, without discussion.
The panelists were presented with 8 to 10 audio-video recorded performances in random order for each scenario. As part of the protocol, they watched the entire encounter, up to a maximum of 5 minutes. The recordings to review were selected from the performances of 31 residents (CA-1 to CA-3) across 12 simulated scenarios. For each of the 12 scenarios, the performances were selected based on available key action total scores to cover the ability spectrum; 2 to 3 performance samples were selected from each quartile of the score distribution. This resulted in 8 performance samples for most scenarios. Scenarios 8 (Blocked Endotracheal Tube) and 10 (Loss of pipeline oxygen) incorporated 10 and 9 performance samples, respectively.
Although the panelists were told how the examinees were being evaluated (ie, key actions), they were not provided with the rubrics (Table 1) or the scores. For each viewed encounter, the panelists were told to determine, based on the consensus definition of the qualified examinee presented and modified earlier, whether the performance typified someone who meets these criteria. Here, they were free to embrace and internally weight any criteria they thought relevant to the decision-making process (eg, correct choice of therapy, proper technique, timing, sequencing, efficiency). Each panelist made an independent judgment of the quality of the resident’s performance on a 0 (not qualified)/1 (qualified) scale. This process was repeated for all performances and for each scenario.
Initially, all panelists reviewed scenario 1 (Bronchospasm) and provided qualified/not qualified ratings. Here, 152 judgments were provided (19 panelists × 8 viewed performances). Following the exercise for this scenario, the panelists were divided into 3 subpanels comprising 6, 6, and 7 members, respectively. Each subpanel provided independent judgments for select scenarios (n = 6) (Table 2). Excluding the first scenario, where all the panelists provided judgments, 4 scenarios were reviewed by 2 of the subpanels. The overlap, by subpanel, in materials viewed was incorporated so that data could be secured to judge the consistency of the standards for independent groups of experts.
For each scenario, the panelists’ ratings (1 = qualified, 0 = not qualified) were averaged for each of the performance samples that were evaluated. For example, if 3 panelists indicated the performance sample represented someone who was qualified, and 3 suggested that it represented someone who was “unqualified,” the average value would be 50% (or 0.5, expressed as a proportion). Overall, the 19 panelists made a total of 933 judgments. For each scenario, a minimum of 8 performance samples were reviewed. The summarized scenario data (average of the panelists’ ratings for each audio-visual performance sample) consisted of 97 observations (8 for most scenarios, 10 for scenario 8, and 9 for scenario 10).
For each scenario, these proportion values (8–10 per scenario) were plotted against the key action scores. Regression techniques were then used to establish the best linear fit between the expert judgments (proportion of panelists indicating the performance represented someone who was qualified) and achievement (key action score associated with the individual performances). For each scenario, the magnitude of this relationship was summarized via the R2 statistics (variance explained). Based on the regression models, a reasonable pass/fail standard for a given exercise is the point on the score scale where there is maximum disagreement among the panelists (eg, where 50% of the panelists indicate that the performance is indicative of someone who is qualified). The point of maximum disagreement lies at the intersection of the two key action score distributions that could be generated based on the panelists’ qualified/unqualified judgements.* For those scenarios rated by multiple panels, the SEM of the standard was calculated. The SEM of the standard provides a measure of the consistency of the cut scores. The standard, plus or minus one SEM, yields a 68% confidence interval for the “true” cut score. Descriptive statistics were used to summarize the panelists’ exit survey responses.
Based on a regression of the aggregate judgments (proportion adequate) on the scenario key action scores, the overall R2 (variance explained) was 0.48, indicating a moderately strong and positive relationship between the expert panelists’ qualified judgments and the actual scores the residents had obtained in the assessment. Because the simulation scenarios are content specific and of varying difficulty, it was necessary to set a standard for each separately. Therefore, the regression of panelist judgments on obtained scores was completed for each scenario. Here, degrees of freedom varied as a function of the number of videotapes reviewed (n = 8–10). The results of the regression analyses, including the estimated standards, are presented in Table 3. The estimated standard was derived by substituting 0.5 (point where panelists are evenly split regarding adequacy of performance) in the regression equation and calculating the corresponding key action score.
For scenarios 1 (bronchospasm), 2 (anaphylaxis) and 3 (unstable ventricular tachycardia), as evidenced by the low R2 values, the regression-based modeling yielded a poor fit (Table 3). For these 3 scenarios, there was little or no relationship between the summary panel judgments and the simulation scores. To show this, a scatterplot of the aggregate adequacy judgments versus the key action scores for the unstable ventricular tachycardia scenario is provided in Figure 2. For the bronchospasm scenario, based on 152 judgments (19 panelists × 8 videotapes), the mean key action score for those performances judged to be qualified was 3.72 (of 4). For performances judged to be unqualified, the mean key action score was 4.0. For both the Anaphylaxis and unstable ventricular tachycardia scenarios, similar patterns were found. Here, there were small differences in mean key action scores as a function of qualified/not qualified judgments. Moreover, the mean for the not qualified judgments exceeded that for the qualified ones. Because the fit of the model for scenarios 1, 2, and 3 was so poor, meaningful cut scores could not be calculated.
For the remaining 9 scenarios (4 through 12), there were reasonably strong relationships between the summary adequacy judgments and the key action scores. Based on shared variance between the judgments and the scores, the relationship was strongest for the loss of pipeline oxygen (R2 = 0.94) and hyperkalemia (R2 = 0.92) scenarios and weakest for the acute hemorrhage (R2 = 0.55) and right bronchial intubation (R2 = 0.45) scenarios. A scatterplot of the aggregate adequacy judgments versus the key action scores for the myocardial ischemia (MI) scenario is provided in Figure 3. Here, in contrast to the unstable ventricular tachycardia scenario (Fig. 2), as the key action score increases, the proportion of judges providing qualified judgments also increases. The regression line, representing the best linear fit between aggregate panelists’ judgments of the MI performances and resident scores, is also provided on Figure 3. Visually, to determine the cut score, one can simply draw a horizontal line from the point on the y axis representing 50% panelist agreement to the intersection of the regression line. The point on the x axis, directly below, is the cut score for the MI scenario.
Detailed Analysis of Scenario Scores
To investigate why the regression-based modeling yielded a poor fit for 3 of the 12 scenarios, we performed a detailed analysis of item and resident performance for each of the chosen samples. For the bronchospasm scenario, it was clear that the distribution of performances was highly negatively skewed; 6 of the residents had perfect scores (4 of 4 key actions credited) and individual item performance (P values) for the 4 key actions ranged from 88 to 1.0 (eg, all candidates received credit for listening to the chest). The mean performance across the 8 samples was 3.75 (93.8%) with a standard deviation of 0.70.
For the anaphylaxis scenario there was only minimal spread in the summed key action scores across the 8 viewed encounters (mean = 4.13, min = 3, max = 5). In terms of the specific items, auscultate lungs and administer epinephrine were always credited. None of the residents was credited for stopping cefazolin. Interestingly, for 2 of the 3 remaining items, the key action scores were negatively correlated with the aggregate judgments of adequacy (r = −0.23 for increase inspired oxygen saturation, r = −0.10 for check blood pressure). This indicates that residents who received credit for these particular items were less likely to be judged to be qualified by the experts.
The score spread for the unstable ventricular tachycardia scenario was also restricted (mean = 3.12, min = 2, max = 4). For all 8 performances, each resident was credited for stating the diagnosis and delivering shock. None of these residents was credited for delivering synchronized cardioconversion. Of the remaining key action items, increase inspired oxygen concentration was positively associated with the aggregate adequacy judgment (r = 0.28) and give/request antiarrhythmic was negatively associated with the aggregate adequacy judgment (r = −0.37).
Comparison of Standards by Panelist Group
For 4 of the scenarios, excluding scenario 1, multiple panels independently provided qualified/unqualified judgments (Table 2). Therefore, it was possible to calculate, and compare, standards for individual scenarios by subpanel. Because all panelists completed the rating process together for scenario 1, and the combined panel ratings for scenario 2 yielded a poor fit, they were not included in the individual subpanel analyses.
For scenario 4 (MI), based on the number of key items attained, the standard derived from the subpanel 2 judgments was 2.72 (R2 = 0.80). The standard derived from the subpanel 3 judgments was 3.10, a difference of 0.38. The SEM for the standard was 0.19. For scenario 6 (tension pneumothorax), the individual subpanel standards were 3.78 and 3.88, yielding an SEM for this standard of 0.05. For scenario 12 (acute hemorrhage), the individual subpanel standards were 2.32 and 2.66, resulting in an SEM of the standard of 0.17.
On a scale from 1 (very uncomfortable) to 5 (very comfortable), the panelists, on average, felt secure with their abilities to make qualified judgments based on their review (mean = 4.4, SD = 0.62, min = 3, max = 5). For some scenarios, including total spine (5 or 7 panelists expressed concern) and hyperkalemia (5 of 6 expressed concern), the panelists indicated that they found it difficult to make expert judgments. To improve the standard setting process, the panelists suggested that more information about the simulation scenarios be provided and that judgments should be made on an individual resident across several scenarios, rather than for different residents within each specific scenario. In addition, to familiarize themselves with the logistics and stress of the simulation-based assessment, the panelists suggested that they be required to manage some of the scenarios before performing the videotape reviews and rating.
Based on our review, this is the first study that describes the steps necessary to establish performance standards for important anesthesia practice skills that, given prior research, can be reliably and validly measured with mannequin-based scenarios in a simulated practice environment.22 Using a previously validated examinee-centered approach,8 where qualified panelists discuss standards of practice, review performance samples, and make adequacy (qualified) judgments, we were able to derive performance standards for many specialized practice domains in anesthesia. For most of the scenarios, there was a moderately strong relationship between the proportion of panelists who thought the performance represented someone who was qualified and the key action scores that were associated with the performances. Based on the regression analyses, meaningful cut scores, representing the minimum performance needed to be classified as qualified, could be established. Moreover, for those scenarios where 2 subpanels made independent judgments of the same performances, the derived cut scores were relatively close. This consistency between panels provides evidence to support the appropriateness of the performance standards.23
Simulation programs can certainly help with training residents or providing continuing medical education opportunities for experienced physicians, especially in the management of rare events.24 Nevertheless, although formative assessment is extremely valuable, there are many situations, especially during postgraduate education, where specific competency decisions must be made. Here, one must be able to relate evaluation measures to specific skill thresholds. For examinations and assessments that incorporate selected responses (eg, multiple choice items), there are numerous validated standard setting techniques.11,25 For performance assessments, the methodologies are not as well developed, although several techniques applicable to SP assessments have been reported in the literature. Because mannequin-based simulations share many commonalities with SP cases, it is reasonable to think that common standard setting methods will be applicable to both. Nevertheless, the multifaceted nature of the skill sets typically measured via mannequin-based scenarios combined with the inherent complexities of developing scenario-specific scoring systems that discriminate along the ability continuum suggest that the evaluation of standard setting methodologies specific to this particular simulation modality is still required.
Because examinee-centered standard setting activities involve the review of numerous performance samples, they often yield information on potential flaws in the assessment process. For 3 of the scenarios (bronchospasm, anaphylaxis, unstable ventricular tachycardia), these inadequacies, combined with other protocol-related factors, may have played some role in our inability to derive a valid performance standard. For the bronchospasm scenario, the majority of participants completed all of the scoring actions. Although efforts were made to select performance samples that included trainees who missed key actions, there was little room, based on the negatively skewed score distribution, to discriminate between adequate and inadequate patient management. The simple solution would be to increase the scenario difficulty or, alternatively, embrace some other standard setting methodology that may or may not be effective. For the anaphylaxis and ventricular tachycardia scenarios, the relatively simplistic scoring systems (sum of key actions) may not have aligned with the criteria that the experts were using to make their judgments. For the anaphylaxis scenario, a diagnostic finding (hives) was withheld until the third minute. After this, the time taken to administer epinephrine, not simply whether it was administered or not (key action), may have influenced the expert judgments. Likewise, for the ventricular tachycardia scenario, management sequence, not captured by the available scoring system, may have confounded the results. Residents who offered both first line (shock) and second line (antiarrhythmic) therapies, but in different orders, could accrue the same number of key action points. If the panelists took the sequence into account, the relationships between their judgments and the scores could certainly be impacted. To address these issues, additional psychometric work pertaining to the scoring systems is warranted. For some scenarios, it may be necessary to weight certain tasks, define acceptable action sequences, and time critical actions. Alternately, provided that a large number of performance samples are available, it may be prudent to develop a more sophisticated scoring system by modeling the judgments of experts.26
In the context of a multiscenario, simulation-based, summative, performance assessment, one can combine the individual scenario standards to form an overall test-based standard. This is commonly done as part of medical performance-based credentialing and licensure examinations.14 Although a conjunctive framework (ie, demanding adequate performance on all scenarios) would ensure that candidates could manage all patient conditions, allowing examinees to compensate for subpart performance in one area with more proficient performance in another, will yield a more reliable, defensible, overall determination of who is, and is not, qualified. As long as the chosen scenarios cover the practice domain and are used simply as vehicles to measure the relevant procedural and management skills, adopting a compensatory scoring system, and complementary compensatory standard setting framework, is preferred.21
With respect to the standard setting protocol, some of our panelists indicated that they would have preferred to see a series of performances from a single resident and then make an overall judgment of their qualification. This strategy, although professionally appealing, can be problematic, especially if the assessment is to be used for large numbers of candidates. First, each panelist will likely adopt slightly different compensation rules regarding the number of scenarios that must be managed adequately. For some panelists, poor performance in one scenario, regardless of performance in all the others, could result in an overall unqualified judgment. Second, if the unit of judgment is the person across multiple scenarios, then multiple scenarios for each examinee must be reviewed. Depending on the size of the examinee sample, this could be logistically challenging. Third, from a test administration perspective, generating scenario-specific standards is efficient. As long as the ability spectrum for a given task is covered adequately, relatively few performance samples need to be reviewed. As well, if the relationship between the difficulties and cut scores is established from a representative sample of scenarios, it will be possible to impute cut scores for new scenarios without convening any additional panelists.27
Although we were able to establish numeric standards for most of the scenarios, this standard setting study was not without limitations. As with all standard setting activities, the resultant cut scores will be somewhat dependent on the choice and number of panelists and the selection of performance samples. Even though the panel was relatively large, and the individual panelists were highly experienced, one could still question its overall representativeness. More important, for some of the scenarios, the sampling of audio-video recordings may not have been adequate. Based on our selection criteria, few, or none, of the performance samples may have truly reflected an unqualified performance, making it impossible to generate meaningful cut scores. In addition, the panelists may have been using other decision-making and technical performance criteria that are not currently reflected in the key actions. If this is true, the validity of the scores for these scenarios, at least with respect to procedural skills, could be questioned. From a practical perspective, although panelists could be provided with the key-action lists, their task of separating adequate from inadequate performance would likely regress to counting the number of key actions completed and determining how many, or even which ones, were required given the ability level of interest. Simply inspecting the rubric and making judgments concerning the number of key actions needed, without viewing the performances, would amount to a test-centered standard setting protocol, a strategy that has previously been shown to be confusing for the panelists, at least for performance-based assessments. Even though the panelist training was extensive, and they were told how the residents were evaluated (key actions), they may also have incorporated other traits (eg, communication), less relevant to this particular assessment, in deciding who was, and was not qualified. Moreover, because some of the panelists were familiar with the trainees in the audio-video recordings, the potential for bias, albeit it is limited by having panelists only rate a single encounter for each resident, exists. Finally, although not a specific limitation of this investigation, it is important to note that the standard, or standards, apply to the specific skill, or skills, one is attempting to measure. To the extent that these skills are not clearly defined, not measured well, or inadequately incorporated in the panelists’ judgments, the resultant cut scores may not yield reliable decisions concerning adequacy or competence.
In addition to the quantitative evidence gathered to support the validity of the scenario standards, we also solicited feedback from the panelists. Interestingly, they found the judgment exercise difficult for 2 of the scenarios (total spine, hyperkalemia) where there was a strong relationship between panelist’s judgments and resident’s scores. Unfortunately, other than identifying which scenarios were challenging, the panelists did not provide reasons as to why. Gathering this type of information as part of future investigations would be valuable. For the postexercise questionnaire, the panelists also indicated that, based on their experience, only between 65% and 97% (mean = 87%) of all residents are ready for independent practice. Even if the group of panelists does not perfectly reflect the profession, this still suggests that skill acquisition during residency training is not adequate. To test this, it would be necessary to administer a multistation simulation assessment, comprised of scenarios where performance standards are available, to a representative group of graduating residents. Their readiness to enter unsupervised practice could be ascertained by looking at their performance in relation to the standards. Similarly, if the goal is to evaluate the curriculum or to identify specific training deficiencies, less-experienced residents could be assessed.22 In the end, for a simulation-based performance assessment to be used to delineate specific examinee abilities or to judge overall proficiency or competence, appropriate performance standards must be set. This demands not only defensible standard setting protocols, but also valid and reliable assessment scores.
© 2008 Lippincott Williams & Wilkins, Inc.