This commentary points to several measurement issues that arise in assessing medical student performance outcomes and then discusses the challenge of interpreting between-school differences. A problem often encountered in assessing student learning is creating an instrument that is at the right “pay grade.” If it is too easy, ceiling effects compress scores. If it is too difficult, examinee performance can compress about chance values, and morale problems can occur. These issues are discussed in the context of a report by Williams and colleagues that measures medical student performance across five institutions on instruments assessing diagnostic pattern recognition and clinical data interpretation. The author of this commentary observes that, when interpreting between-school differences in assessing student learning, what can seem like small differences can have important consequences.
Dr. Albanese is professor emeritus, University of Wisconsin School of Medicine and Public Health, and director of research, National Conference of Bar Examiners, both in Madison, Wisconsin.
Correspondence can be addressed to Dr. Albanese, 610 Walnut St., 1007C, Madison, WI 53726-2397; telephone: (608) 316-3051; fax: (608) 442-7974; e-mail: email@example.com.
Editor's Note: This is a commentary on Williams RG, Klamen DL, White CB, et al. Tracking development of clinical reasoning ability across five medical schools using a progress test. Acad Med. 2011:86;1148–1154.
Williams and colleagues1 are to be congratulated for their evidence-based and systematic approach to developing and validating instruments to assess the clinical reasoning ability of medical students. Developing and pilot testing their instruments at one institution and then enlisting a consortium of four more schools for determining the generalizability of findings is a methodology that should serve as a model for others engaged in this type of endeavor. However, instrument development is not for the faint-hearted, and there are assorted measurement and interpretive challenges that must be faced when developing any instrument of assessment. There is also the issue of between-school variability that comes with determining the generalizability of findings. In this commentary, I use the study by Williams and colleagues to illustrate several measurement issues that arise in this type of work and then discuss the challenge of interpreting between-school differences.
Targeting an Assessment to the Right “Pay Grade”
A problem often encountered in assessing student learning is creating an instrument that is at the right “pay grade” for the particular examinee population. If the instrument is too easy, ceiling effects compress scores. Higher-level examinees reach the uppermost score value and have nowhere to go from there. If it is too difficult, examinee performance can be compressed about the chance value, and morale problems can occur.
Williams and colleagues evaluated the possibility of ceiling effects on tests measuring diagnostic pattern recognition (DPR) and clinical data interpretation (CDI) by computing the proportion of the remaining distance that mean scores increased from year to year among first-, second-, third-, and fourth-year medical students at five institutions (Group 0 completed the tests at orientation in the first year before receiving any training, Group 1 at orientation to the second year after students had completed their first year of medical school, etc.). This measurement provides an indicator of whether scores were proportionately reduced as they approach the maximum; however, it provides less of an indicator of whether high scores are truncated at the maximum.
Ceiling effects most likely affected the Group 3 scores on the DPR instrument because both the increase in means from Group 2 to Group 3 and Group 3 SDs declined from previous years. To show how truncation works, assume for a moment that scores above 100% could be obtained. Using School A as an example, suppose that, instead of the 2.64 gain reported, in reality the gain from Group 2 to Group 3 had been the same as that from Group 1 to Group 2 (9.32 points, making the Group 3 mean score 91.96) and that the Group 3 SD had remained the same as Group 2's (8.29). Using a normal distribution approximation, 16.6% of the Group 3 scores would exceed 100%. Truncating these scores to 100% reduces both the mean and SD. However, it is likely that ceiling effects result from both compression and truncation. Compression occurs because some of the questions are so difficult that they push the limit of the examinees' pay grade, so only those who are truncated have the requisite skills to go above and beyond. Those with high ability, but not so high as to cause truncation, answer all but the most difficult questions correctly. Thus, ceiling effects probably operate by both compression from the most difficult questions and truncation of scores for the most clinically adept. Although the proportionate reduction reported by Williams and colleagues indicates more reduction than would be expected by ceiling effects due to compression, truncation was not well modeled by the approach and could account for the observed results.
The CDI scores had no evidence of the type of ceiling effects shown by the DPR, but this could be because the examination was beyond the examinee pay grade. Group 3 means never exceeded 60% on the CDI test. In a competency-based world, this would be worrisome. It could also demoralize examinees to perform so poorly on what is arguably the most clinically relevant measure they may have faced so far in their training. Another potential cause for the low scores might have been the response scale. Examinees were presented with a brief clinical scenario followed by a “what if” situation that included new information predicated on their thinking of a particular diagnostic hypothesis. The examinee was then asked to select one of five response options that described how the new information affected his or her view of the correctness of the diagnostic hypothesis: the hypothesis is almost eliminated; the hypothesis becomes less probable; the information has virtually no effect on the hypothesis; the hypothesis becomes more probable; or the hypothesis is almost certainly correct.
The two extreme values on both ends could be considered a subset of the second and fourth values, respectively (being “almost certain” is essentially the same as being “much, much, much more probable”). Navigating the subtleties of the possible responses may have required more clinical judgment than medical students can be expected to muster. Qualifying the extreme options by “almost” introduces a degree of complexity that can cause items to move beyond a medical student's pay grade. How much more or less probable must something be for it to be almost definitive? It would be interesting to see what the results would be if the questions were scored on a three-point scale by pooling the two extreme values with their neighbors.
Reliability influences in clinical assessment
The authors conclude that reliability values found with Group 2 were somewhat higher than in other groups because “student knowledge applicable to these examinations was best integrated in the minds of students at that stage of their training.”1 As a statistician, my first thought was that Group 2 must have had the greatest variability, because higher reliability usually is associated with greater variability. As expected, the SD for all five schools was largest for Group 2. So, what could create the greatest variability in examinees? It may be that the material has the best integration at that point, as the authors say; but, the point after the second year of study may be where students are the most variable in their readiness to go to the next level. Some students may still have been clueless and struggling to make sense of the mass of clinical information presented in the tests, while others were fast making their way to being sophisticated clinical diagnosticians. It may be for these latter students that the material was best integrated after the second year; for in the third year, when many become immersed in different clinics with so many different ways of doing things, learning the mechanics of the clinic might slow down or even interfere with students' continuing diagnostic growth.
Interpreting Differences Between Schools
Finally, when comparing results of an assessment between schools, as Williams and colleagues have done, interpretation of the differences must be handled with care. The authors found that 6.7% of the DPR variance and 3.4% of the CDI variance was related to schools (including both the school and interaction variance), and that performance at all five schools ended up at about the same place among Group 3 students. These percentages of variance attributable to medical school are also consistent with what other investigators have found.2 Williams and colleagues suggest that schools should consider these results “before investing time and effort in curriculum revision.”1 They later qualify this by indicating that curriculum differences may influence other types of student performance not measured by these tests, but “changes in student knowledge and diagnostic ability are likely to be modest”1 as a result of curriculum change.
So, what is a “modest” difference in performance, and when is a modest difference something important? First, what is modest? The school means reported in this study are based on scores from 72 to 190 students. So, even a small difference in a school's mean (or means between schools) requires that a large number of students had to do something different to create that difference. It is also important to recall that the interaction between school and year was statistically significant, meaning that the differences in gains between schools were more than random variation. Focusing on the differences in schools in the gain from one year to the next, the DPR showed gains (differences in means) in the first year that ranged from 14.39 to 21.83. The largest gain at a school was 52% higher than the smallest gain ([21.83 − 14.39]/14.39 × 100 = 52%). Table 1 shows analogous comparisons from year to year for Groups 0 and 1, Groups 1 and 2, and Groups 2 and 3 for the DPR and CDI.
The school-to-school differences in DPR score gain increased at an accelerating rate at the three comparison points, with the largest differential being almost 300% for the final comparison between Groups 2 and 3. The pattern for the CDI differed substantially. The first comparison (Groups 0 and 1) showed almost a 200% difference in gain between the highest and lowest school, declining to only 21% difference at the second comparison (Groups 1 and 2) and then rebounding to over 300% for the third comparison (Groups 2 and 3). Expressing the school differences as percentage gains may magnify some relatively small differences, particularly when comparing Groups 2 and 3, but the point is to focus on the relative differences in the schools. Among the individual schools, School A has the strangest pattern because something occurring during year 2 washes out the gains in CDI scores from the first year. For the CDI, School A's early massive gain between Groups 0 and 1 (15%) petered out to the smallest gain between Groups 1 and 2 (5.2%) and to the only loss suffered by any school on either measure between Groups 2 and 3 (−1.4%). It appears that the schools all had different trajectories, but by year 3 they differed by at most 3% on the DPR and 2% on the CDI—approximately two items. Does it really matter whether the schools take different paths if they end up at about the same place? If some of the paths are like a walk in the park while others are like scaling Mt. Everest, it can make quite a difference. Without knowing more about students' experiences in their third year, it is impossible to know whether the curricular differences were significantly different, or just a walk in a different park.
So, do the differences found by Williams and colleagues truly make a difference? For starters, keep in mind that the exercise was low-stakes for all participating students. Not only were the stakes low, they were nonexistent because all participants were volunteers. How students will perform if these tests should become part of student promotion requirements cannot be told.
Assuming that the observed differences would be maintained in a high-stakes use, is accounting for 3.4% to 6.7% of the variance a difference that makes a difference? For starters, ceiling effects could be compressing or truncating the highest scores, which would make the variance due to curriculum actually larger than these values. But, even assuming that these variance values are accurate, medical schools are complex systems. The more we learn about complex systems, the more we find that change can be unpredictable,3,4 and the difference between a good (or not so good) medical school and a great medical school can be nuanced. So, 3.4% to 6.7% of the variance may not seem like much—but only 4% of our DNA keeps you and me from being a chimpanzee.5 Does that make a difference? You make the call.
The author's comments are strictly his own and reflect neither the National Conference of Bar Examiners nor the University of Wisconsin.
1 Williams RG, Klamen DL, White CB, et al. Tracking development of clinical reasoning ability across five medical schools using a progress test. Acad Med. 2011;86:1148–1154.
2 Hecker K, Violato C. How much do differences in medical schools influence student performance? A longitudinal study employing hierarchical linear modeling. Teach Learn Med. 2008;20:104–113.
3 Pisek P. Redesigning health care with insights from the science of complex adaptive systems. In: Crossing the Quality Chasm: A New Health System for the 21st Century. Washington, DC: National Academies Press; 2001:309–323.
5 Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature. 2005;437:69–87.