Score validity is of central concern to any organization or school involved in high-stakes testing.1 Validation research entails clearly identifying the purpose for which test scores are to be used so that appropriate empirical evidence can be gathered to substantiate the intended score-based inferences.2 The validity of these score-based interpretations can be weakened by several test-related phenomena, including breaches to the security of the environment. The impacts of various forms of test security breaches need to be clearly addressed to determine the extent to which a priori knowledge of materials might provide an undue advantage to subgroups of examinees. This evidence also ensures that misinterpretation of scores is minimized on the part of the user. This task is especially crucial with performance-based tests such as standardized patient (SP) examinations, given the typically limited nature of case banks, the long exposure of items/cases, and the high costs associated with developing these types of assessments.3
Impact of Security Breaches on Test Performance
The literature devoted to assessing the impacts of various forms of security breaches on the performances of students completing SP tests has reported mixed findings. Most investigations undertaken in this area have been aimed at determining whether mean scores on SP tests vary significantly when cases are administered throughout an extended interval, ranging from as little as several weeks4 to as much as an academic year.5 The authors of these studies have reported that mean station or case scores generally remain stable and that the reuse of identical cases, consequently, appears to have only a minimal impact on the scores of students taking the examination at different periods of time throughout the administration cycle.4,6,7,8 However, other research suggests that the reuse of identical cases can yield an increase in overall mean score, prompting a suggestion that the number of common cases be kept at a minimum across forms.5,9,10 Swartz, Colliver, Cohen, and Barrows11,12 examined whether collusion among students did affect overall SP test scores in a more systematized fashion by encouraging students who took the examination in the early stages of administration to share as much information as possible about the cases with students scheduled to be tested at a later date. The authors found little evidence that information-sharing among students affected performance.
It is important to underscore that those studies restricted their view of a test security breach to various degrees of (presumed) information-sharing among examinees. It can be argued that complicity among students, although a common form of a test-security breach, is probably one of its most benign manifestations. This is especially likely with low- to moderate-stakes SP examinations, where students' motivation to engage in information sharing is low. In a high-stakes context (e.g., in licensure and certification testing), dishonest coaching organizations and examinees might employ a host of illicit means to obtain and disseminate actual test materials. A study undertaken by De Champlain et al.13 did model the impact of additional, more severe forms of test-security breaches on examinees' performances such as those that would result from students' having access to formal materials prior to taking the examination. The authors reported that disclosing test materials, whether it be directly to a subgroup of examinees or via a dishonest coaching course, led to significant checklist performance gains for a sample of United States medical graduates (USMGs). However, the impact of disclosure on interpersonal skills (IPS) scores was nil. Although informative, it is important to point out that these findings were based on a small and homogeneous sample with respect to examinees' medical education and clinical skill levels. As such, there is a need for this type of research to be replicated with a more varied sample of examinees, to obtain an estimate of disclosure effects that might generalize to a more heterogeneous population of medical students.
The purpose of the present study was to model the impact of disclosing test materials on SP examination scores with a sample of international medical graduates. Furthermore, it is hoped that ensuing findings will provide a practical estimate of expected effect size within the context of this type of security breach and with this population.
Method
Examination. In this investigation, the SP test assessed the clinical (history taking, physical examination, communication) skills and IPS of physicians about to enter supervised practice. SPs are laypeople trained to portray one of a variety of clinical scenarios. Test candidates rotate through these scenarios (or cases) and encounter patients in a setting intended to reflect an ambulatory care clinic. Case-specific checklists are used to assess examinees' clinical skills. These checklists are composed of dichotomously scored items, each of which represents a single action that is expected to be done by the student. A percent-correct score, corresponding to the number of actions done by the student out of the total number of behaviors listed in a given checklist, is computed for all encounters. IPS are assessed with the Patient Perception Questionnaire (PPQ), a case-independent inventory that is composed of six five-point Likert scale items. A percent-correct PPQ score is also computed and reported to each student for all encounters. Both measurement instruments are completed by the SP following each 15-minute encounter with the student. The same ten cases (chosen from the available pool) were administered to all examinees. The cases were selected to reflect the majority of cells contained in the test blueprint with regard to both skill and content domains.
Scoring Procedure. In this examination, two SPs were trained to portray each case. For any given case, the performing SP portrayed the actual clinical scenario with the examinee, whereas the monitoring SP observed the encounter as it proceeded on a video screen in a separate room. Each student's final percent-correct checklist score reflected the consensus reached by the performing and monitoring SPs as to what constituted the appropriate response to each item. Videotape review was instituted to arrive at a consensus if two or more discrepancies per checklist were noted in any given encounter. Of the 9,625 checklist item responses recorded (77 students × 125 checklist items across the ten cases), videotape review was necessary for 202 (2.10%). The PPQ percent-correct score was derived from the performing SP.
Examinees. Seventy-seven international medical graduates (IMGs), recruited from the Los Angeles metropolitan area, participated in this study and were blinded to its purpose. All examinees were certified by the Educational Commission for Foreign Medical Graduates, i.e., they had successfully passed the following examinations: Step 1 and Step 2 of the United States Medical Licensing Examination and a test of English-language proficiency. The examinees were paid for their participation and randomly assigned to one of two testing conditions: control or security breach (SB). The testing environment for examinees assigned to the control condition (n = 32) was representative of a “normal” assessment situation (i.e., participants received routine prior information about the test but no materials from the examination). In the SB condition, we attempted to model a situation in which actual case materials were disclosed. Examinees in the SB condition (n = 45) were directly provided with the checklists for five of the ten cases to be seen (referred to as the exposed cases) as well as the PPQ, and were given one to two hours to review these materials prior to completing the test. Information pertaining to the five non-exposed cases was not disclosed to any of the examinees participating in this study. Cases included in the exposed and non-exposed sets were matched with respect to the main areas of this SP test's blueprint.
Analyses. Two separate analyses of covariance (ANCOVAs) were undertaken to compare the performances of the two groups on the five exposed cases. For both models, the condition factor (control or SB) was treated as the independent variable. The mean percent-correct checklist score on the five non-exposed cases was treated as the covariate in the first ANCOVA, while the mean percent-correct checklist score on the five exposed cases was deemed to be the dependent variable (DV). In the second analysis, the mean percent-correct PPQ score on the five non-exposed was deemed to be the covariate, whereas the mean percent-correct PPQ score on the five exposed cases was treated as the DV.
Results
Mean scores and standard errors on the five exposed cases for examinees assigned to each of the two conditions, adjusted for initial differences in ability between groups, were as follows:
For examinees assigned to the control condition,
- the adjusted mean percent-correct checklist score was 54.53 (SE = 1.48), and
- the adjusted mean percent-correct Patient Perception Questionnaire score was 60.87 (SE = 1.18).
For examinees assigned to the security breach condition,
- the adjusted mean percent-correct checklist score was 59.95 (SE = 1.24), and
- the adjusted mean percent-correct Patient Perception Questionnaire score was 67.03 (SE = 0.99).
A significant group main effect was obtained in the first ANCOVA, F(1,74) = 7.66, p =.0071. For the exposed cases, the SB group (adjusted M = 59.95%) significantly outperformed the control group (adjusted M = 54.53) on the checklist. Similarly, the mean PPQ score for examinees assigned to the SB condition (adjusted M = 67.03%) was significantly higher than the mean estimated for the control group (adjusted M = 60.87%), F(1,74) = 15.84, p =.0002.
Conclusions
Results obtained in the present study with a sample of international medical graduates mirror those reported in previous research with USMGs.13 Disclosing checklist items led to significant performance gains for the examinees assigned to the SB condition. The gain noted in this investigation (5.4%), was, however, slightly lower than that obtained with a sample of USMGs. This is probably attributable to the larger number of cases administered in the test form (ten as opposed to six in the past USMG study). Therefore, the challenge posed to the IMGs was slightly more daunting, as they had to sift through ten cases to identify the clinical scenarios for which they possessed disclosed materials and apply this information accordingly. Nonetheless, the gain noted would concretely translate itself into a 4.4-checklist-item disadvantage over five cases (slightly less than one item per case). This advantage might be inconsequential for most USMGs, who typically perform well above the cut-score on this type of examination.14 However, it could significantly affect decision consistency for IMGs, whose scores tend to cluster in the vicinity of the pass/fail standard in a larger proportion. The control and SB groups also did differ significantly with respect to their mean PPQ scores, a result that was not found with USMGs.13 Interestingly, the difference between the two groups (6.2%) was actually larger than the one resulting from disclosing checklist items. This could reflect a difference in interaction styles that is culturally based. Disclosing simple indicators of IPS (such as the Likert-scale items found on the PPQ) to SB group examinees yielded a mean score that was similar to that typically encountered with U.S. medical students. It is also worth noting that the type of case that was most susceptible to the effects of disclosure appears to be population-dependent. For U.S. medical students, prior research suggested that cases involving largely mechanical physical examination maneuvers were the easiest to memorize and consequently reflected the highest performance gains for those examinees with prior knowledge of materials. Divulging materials for cases that primarily require communication and IPS in the interaction with the patient proved to be the most beneficial for our sample of IMGs. Again, these findings appear be indicative of differences in the way our sample of IMGs interacted with the SPs. These results suggest that providing a clear description of the examination and its goal to all examinees prior to the administration (in some form of information bulletin, for example) is necessary to ensure a common understanding of expected behavior on the part of students.
In summary, the results presented in this study provide further evidence that the secure handling of test materials is essential for all examinations, whether they be traditional in format or performance-based. Although the security breach modeled in this investigation was severe (half of the test materials were directly exposed to students), steps can nonetheless be undertaken to minimize the likelihood of materials being disclosed. This, in turn, might lessen the impact of a security breach should checklists or other pertinent information fall into the hands of dishonest individuals.
One obvious strategy that should be adopted with all SP tests is to clearly lay out the flow of materials and restrict access solely to concerned staff so that these individuals can be held accountable for receipt and safekeeping of this information. Delivering the measurement instruments via a computer network also seems advisable, given the greater control that the latter medium can afford and the virtual elimination of a “paper trail.” The results of our study also point out the need to increase test development efforts to minimize the likelihood of a security breach. Increasing the pool of available cases enables a more frequent rotation of forms within and across test sites, thus limiting the exposure rate for any given set. Finally, the use of modeled or cloned cases also seems desirable to increase the size of the case pool and thwart those individuals who may have mechanically memorized cases and accompanying materials. Modeled cases are defined as those presenting a similar opening scenario but requiring a different work-up on the part of the student. Cloned cases, on the other hand, call for a similar set of actions on the part of the student but present different contexts.
Although informative, our results need to be interpreted in light of several limitations. First, the sample size examined was small, and generalizations should be made with caution. Our sample was also composed of IMGs who were perhaps atypical of the corresponding population, given that they had successfully fulfilled several U.S. medical licensing requirements (passed the USMLE Step 1 and Step 2 and a test of English-language proficiency). Consequently, the effect sizes reported in this study should probably be viewed as lower-bound estimates of what to expect in an operational testing context. Replication of this research with different groups of both IMGs and USMGs seems advisable. This research might also permit us to test the hypothesis that lower-ability students might benefit more from gaining access to materials than would those who are more proficient. From a test-development perspective, pursuing research that focuses on the identification of characteristics that make a case more vulnerable to memorization would also be helpful. Finally, the findings reported in this study underscore the need to develop methods to detect breaches to the security of the testing environment. Research aimed at assessing the usefulness of “tagged” checklist items and other means should be pursued.15
Testing organizations and medical schools should always be vigilant in guarding themselves against dishonest examinees and organizations that may wish to compromise the secure nature of the testing environment. This investigation confirms past findings in that the psychometric properties of the SP examination described appear to be vulnerable to blatant disclosure of testing materials. It is hoped that the results presented in this article will foster future relevant research that will ultimately lead to the implementation of secure SP tests for licensure and other purposes.
References
1. Kane M. The validity of licensure examination. Am Psychologist. 1982;37:911–8.
2. Messick S. Standards of validity and the validity of standards in performance assessment. Educational Measurement Issues and Practice. 1995;14:5–8.
3. Mehrens WA, Phillips SE, Schram M. Survey of test security practices. Educational Measurement: Issues and Practice. 1993;12:5–19.
4. Colliver JA, Barrows HS, Vu NV, Verhulst SJ, Mast TA, Travis TA. Test security in examinations that use standardized-patient cases at one medical school. Acad Med. 1991;66:279–82.
5. Cohen R, Rothman AI, Ross J, Poldre P. Impact of repeated use of objective structured clinical examination stations. Acad Med. 1993;68(10 suppl):S73–S75.
6. Colliver JA, Travis TA, Robbs RS, Barnhart AJ, Shirar LE, Vu NV. Test security in standardized-patient examinations: analysis with scores on working diagnosis and final diagnosis. Acad Med. 1992;67(10 suppl):S7–S9.
7. Niehaus AH, DaRosa DA, Markwell SJ, Folse R. Is test security a concern when OSCE stations are repeated across clerkship rotations? Acad Med. 1996;71:287–9.
8. Stillman PL, Haley HA, Sutnick AL, et al. Is test security an issue in a multistation clinical assessment? A preliminary study. Acad Med. 1991;66(10 suppl):S25–S27.
9. Jolly BC, Newble DI, Chinner T. The learning effect of re-using stations in an objective structured clinical examination. Teach Learn Med. 1993;5:66–71.
10. Jolly BC, Cohen R, Newble DI, Rothman AI. Possible effects of reusing OSCE stations. Acad Med. 1996;71:1023–4.
11. Swartz MH, Colliver JA, Cohen DS, Barrows HS. The effect of deliberate, excessive violations of test security on performance on a standardized-patient examination. Acad Med. 1993;68:76–8.
12. Swartz MH, Colliver JA, Cohen DS, Barrows HS. The effect of deliberate, excessive violations of test security on a standardized-patient examination: an extended analysis. In: Rothman AI, Cohen R (eds). Proceedings of the Sixth Ottawa Conference on Medical Education. Toronto, Ontario, Canada: University of Toronto Bookstore Publishers, 1994:280–4.
13. De Champlain AF, Macmillan MK, Margolis MJ, et al. Modeling the effects of security breaches on students' performances on a large-scale standardized patient examination. Acad Med. 1999;74(10 suppl):S49–S51.
14. Margolis MJ, De Champlain AF, Klass DJ. Setting examination-level standards for a performance-based assessment of physicians' clinical skills. Acad Med. 1998;73(10 suppl):S114–S116.
15. Macmillan MK, De Champlain AF, Klass DJ. Using tagged items to detect threats to security in a nationally administered standardized patient examination. Acad Med. 1999;74(10 suppl):S55–S57.
Section Description
Research in Medical Education: Proceedings of the Thirty-ninth Annual Conference. October 30 - November 1, 2000. Chair: Beth Dawson. Editor: M. Brownell Anderson. Foreword by Beth Dawson, PhD.