The inclusion of a clinical skills examination in the testing system for medical licensure provides the opportunity to assess history-taking and physical examination skills, communication and interpersonal skills, spoken English proficiency, and documentation of findings in a structured patient note.1 Whereas the use of standardized patients allows for a high-fidelity simulation of the physician–patient encounter, it is nonetheless possible that specific artifacts of the testing situation may have a nontrivial impact on scores. One type of evidence that such artifacts may exist would be systematic trends in performance across the testing day. If these trends, or sequence effects, resulted from the artificial constraints imposed by the testing conditions, any impact on examinee scores would represent construct-irrelevant variance. It is therefore incumbent on test developers to scrutinize evidence regarding potential threats to the validity of score interpretations; such scrutiny is consistent with Cronbach’s admonition: “A proposition deserves some degree of trust only when it has survived serious attempts to falsify it.”2(p103)
Previous research has yielded inconsistent results regarding the presence of sequence effects in standardized-patient assessments. Lloyd and colleagues3 examined composite scores (based on a standardized-patient-completed checklist and a series of post-encounter written questions about the case) for a five-station medical-school-based examination and reported a significant upward score trend across encounters. McKinley and Boulet4 investigated four separate scores (history taking and physical examination, written communication, interpersonal skills, and spoken English proficiency) from the Educational Commission for Foreign Medical Graduate’s Clinical Skills Assessment and reported a clear upward trend in performance across the testing day. In contrast to these studies, other reports either have failed to show sequence effects or have reported inconsistent or contradictory results across cohorts of examinees.5,6
Given the unlikely scenario that examinee proficiency changes during the course of an examination, it seems reasonable to conclude that the presence of sequence effects is evidence of an influence on scores that is unrelated to the construct of interest. Previous authors have suggested that unfamiliarity with the format may produce anxiety that impacts scores on the initial cases in an examination.3,4 Alternatively, it may be that examinees must acclimate to working within standardized time constraints or other artificial aspects of the encounter. It is also possible that the effects are the result of changes in patient stringency or warm up or fatigue effects rather than changes in examinee performance. A consistent effect for components scored by the patient that is absent from scores that are not produced by the patient would suggest that the effect results from a change in patient stringency and not examinee performance.
Previous researchers have expressed concern about the possible causes of these effects, but they generally have considered the impact to be negligible when evaluating test scores rather than case scores. This conclusion is in part based on the fact that all examinees will be affected similarly by any sequence effect. However, in the context of large-scale, high-stakes assessments such as the USMLE Step 2 CS Examination, the problem may be more complicated. Such tests may include unscored pretest cases which are administered randomly throughout the test. If sequence effects exist, an examinee whose first case is not scored may have a systematic advantage over one whose last case is not scored.
The purpose of the present paper was to evaluate the presence of trends in examinee performance across the USMLE Step 2 CS Examination testing day. The presence of these effects was investigated for the four scored components of the test (data gathering, documentation, Communication and Interpersonal Skills (CIS), and spoken English proficiency) and the results were further examined based on examinee gender.
Step 2 CS Examination
In each testing session of the Step 2 CS Examination, examinees cycle through 12 rooms and interact with 12 different standardized patients. Each patient is seen by an examinee in each of the 12 sequence positions. The specific set of patients for a given session is selected to meet content constraints which ensure that examinees testing at different times will encounter a balanced and equivalent selection of case challenges.
For each encounter, examinees first read a brief set of instructions that (1) explain the purpose of the patient’s visit, and (2) outline expectations for collecting history information and completing a physical examination. Examinees then have 15 minutes to interact with the patient and 10 minutes after the encounter to complete a structured patient note (notes subsequently are scored by trained physician raters using case-specific algorithms). While examinees complete the patient note outside the room, patients complete several instruments inside the room: (1) a structured checklist designed to evaluate the examinee’s data gathering skills, (2) a set of rating scales designed to evaluate CIS, and (3) a rating scale designed to evaluate spoken English proficiency.
Scores were selected from a sample of approximately 24,000 examinees who tested at all five test centers (Philadelphia, Atlanta, Chicago, Los Angeles, and Houston) between October 2005 and July 2006. Because all participating examinees had given permission for their data to be used in research and all data were collected as part of routine operational testing, no additional IRB review was required. Fifty-seven percent of the examinees were male, 45% were international graduates, and 42% reported having English as a second language.
Two analytic procedures were implemented. For each of the scores, an analysis of covariance (ANCOVA) was conducted: the score was the dependent measure, and case sequence and gender were treated as factors in the analysis. Nominal variables identifying (1) examinees reporting English as a second language, and (2) examinees who graduated from U.S. medical schools were included as covariates. Because each case occurs within each sequence position an approximately equal number of times within each administration, case difficulty is automatically counterbalanced across sequence position. This is only approximate because some sessions have fewer than 12 examinees; for these sessions, all cases do not occur in each sequence position. The resulting missing data can be considered missing at random. This will result in a reduction in the precision with which effects can be estimated, but it will not impact the results systematically.
The second analytic procedure was a hierarchical linear modeling approach. This procedure has the advantage that it appropriately accounts for the fact that individual case scores are nested in examinees. At the first level of the model, case scores were nested in examinees and sequence position was coded as a nominal variable. At level two, nominal variables identifying (1) examinees reporting English as a second language, (2) examinees who graduated from U.S. medical schools, and (3) examinee gender were used to account for score differences at level one. This hierarchical analysis was conducted for each of the four components. The ANCOVA tested for overall variation across sequential positions, and the hierarchical analysis tested for a difference between each position and the final position. Post hoc analyses were completed as appropriate to test for the presence of increases or decreases in performance across pairs of sequential positions.
Overall, the two analytic procedures produced essentially identical results. The information provided in this section will focus on the simpler and more familiar ANCOVA; specific results from the hierarchical modeling analysis will be provided when appropriate. Figure 1 presents graphs which show the estimated marginal means by sequence position for each of the four scores. In each graph, the vertical axis is scaled to represent 1.25 observed-score standard deviations for the respective score scale. Three of the four graphs (CIS, data gathering, and documentation) show a consistent pattern: (1) there is a general upward trend in scores across sequence positions, and (2) females outperform males. In all four analyses, gender differences were statistically significant (P < .001); for CIS, data gathering, and documentation, sequence was significant (P < .001). The hierarchical analysis indicated that scores from early positions in the sequence were significantly different than those from the final position for these same three scores. In the case of CIS, all positions except the 11th were significantly different than the 12th. For data gathering, the first five positions were significantly different than the final position; for documentation, scores from the first four positions were significantly different than those from the final position. For CIS and data gathering, there also was a significant gender × sequence interaction (P < .001). For the three components that show a sequence effect, it is also apparent that the score change by sequence is greatest from positions one through five and then becomes relatively more stable for the latter part of the test; the largest change is generally between positions one and two.
The trends apparent in the CIS graph are particularly interesting. The discrepancy in performance for males and females is greater for sequence position one than for any other position. This suggests that the effect is greater for males than for females, and this interpretation is consistent with the significant interaction effect. In addition, there is a noticeable drop in scores between positions five and six and between positions nine and ten; these drops occur after the two scheduled testing breaks. Post hoc analysis within both analytic procedures shows significant score differences between positions five and six and between nine and ten.
Although the literature suggests some ambiguity regarding expectations for sequence effects in standardized-patient assessments, the results presented in this paper do suggest the presence of such effects in the USMLE Step 2 CS Examination. Given these findings, two main questions arise: (1) are the effects large enough to be of practical concern? and (2) what are the causes?
Previous authors have concluded that sequence effects are ignorable either because the magnitude is small or because the effects act on all examinees similarly.3,4 In the present study, conclusions about the results depend on the nature of the intended interpretation. Expected performance in the early sequence positions is sufficiently lower than it is in the latter positions that it would substantively interfere with interpretations about an examinee’s relative performance across cases. That is, a conclusion that an examinee was more able to handle the content presented in case sequence 12 versus sequence one could be influenced substantively by the sequence effect. The understanding that both systematic and random effects of this type are present at the case-score level has guided the decision not to provide content or case-level feedback to examinees. The present results provide an additional caution against interpreting examinee-level score differences across cases.
The sequence effects also are large enough that the impact on total score may be nontrivial. For example, an average examinee with the first case unscored would be expected, all other things being equal, to receive a score about one tenth of a point (0.09) higher on the scale for communication and interpersonal skills than would an examinee with the last case unscored. This is a small effect on the overall score scale (i.e., approximately 6% of an observed score standard deviation) and less than 1% of the entire examinee population falls within one tenth of a point above or below the cut score, but it may be an overstatement to refer to it as insignificant. First, the effect is systematic rather than random at the examinee level; it will average out across but not within examinees. The effect also will have the potential to impact three of the four score scales for an examinee; an examinee whose first case goes unscored will have an expected advantage on all three of those scores.
Finally, results indicate that the sequence effect does not impact all examinees equally for the CIS and data gathering scores. The significant gender × sequence interactions indicate that the nature of the effect is not uniform across gender groups, and additional research may demonstrate similar effects for other examinee groups. It therefore would be an oversimplification to conclude that the effects are the same for all examinees.
The results also may provide some insight into the cause of these effects. For example, one important question is whether the effects are the result of sequential changes in examinee performance or patient stringency in scoring. Patient notes are rated after the fact and in a different sequence, and the identification of sequence effects with these scores provides suggestive but not definitive support for the difference being related to examinee performance. Although this pattern of results rules out the possibility that the sequence effects for documentation are attributable to changes in patient stringency, the effect could still be attributed to patients if they systematically provide more information to examinees as the day progresses. The nonuniform sequential change in performance for males and females also would seem to argue against the explanation of changes in patient stringency. Again, however, differential changes in stringency by gender are possible. Finally, the lack of a sequence effect for spoken English proficiency also would seem to argue against the presence of a generalized stringency effect on the part of the patients. This again is only suggestive; the lack of a stringency effect for spoken English does not prove that there could be no effect for the other components.
It should be acknowledged that the interpretation of the effects detected in this investigation are somewhat limited, and that these results suggest the need for further investigation of possible causes at the examinee level and at the rater level. Individual session characteristics (e.g., the specific collection of patient presentations for that examination day) and site characteristics (e.g., ways in which the testing sites might vary) should also be considered. In addition, it might be helpful to investigate the relationship of sequence effects to examinee characteristics other than gender (e.g., type or location of medical school, or English speaking skills).
It is clear that more research is needed to fully understand these effects; a more complete understanding of the effects may lead to strategies to eliminate them. For example, if evidence suggests that time constraints impact performance on cases early in the sequence, additional time could be allocated to those cases. Alternatively, if evidence suggests that artificial aspects of the testing format (such as how patients interact with or respond to examinees) is a contributing factor, changes in patient training may be appropriate. Previous authors have suggested that it may be possible to make adjustments for these effects.5 Unfortunately, the fact that they are not uniform across examinee groups complicates this approach. One possible strategy would be to consider the first case as practice and allow it to be unscored. This strategy is limited by the fact that it would lower the reliability of the test and only partially eliminate the identified effect. Given the modest size of the effect, implementing such a strategy without careful study of the causes of the effect would be premature.
It is hoped that future research will provide additional insight into the identified effects. Research currently planned will attempt to provide more insight into the observed effects. It is hoped that assessing the relationship between sequence effects and the amount of time examinees use in the patient interactions, collecting information from examinees about specific aspects of the testing experience that could be related to the present findings, and sequential review of examinee performance will yield such additional insight.
The process of validation is necessarily both continuous and iterative. Critical review of assessment procedures must be followed by modification of those procedures when empirical evidence supports such change. To ensure a high-quality examination process, it is imperative to critically examine the assessment and to be prepared to address issues that warrant attention.