Secondary Logo

Journal Logo

Psychometrics of High-Stakes Examinations

A Multivariate Generalizability Analysis of History-Taking and Physical Examination Scores From the USMLE Step 2 Clinical Skills Examination

Clauser, Brian E.; Balog, Kevin; Harik, Polina; Mee, Janet; Kahraman, Nilufer

Editor(s): Szauter, Karen MD; Blackmore, David PhD

Author Information
doi: 10.1097/ACM.0b013e3181b36fda
  • Free


Standardized-patient-based assessments provide a complex and potentially realistic context for assessing physicians' clinical skills. Individual patient encounters often result in multiple scores. In some assessments, these scores are aggregated to produce a single composite measure, and in others the examinee may receive a profile representing performance across multiple distinct proficiencies. For example, it is not unusual for these assessments to provide separate scores for communication skills and skills related to data collection. The Educational Commission for Foreign Medical Graduates (ECFMG) Clinical Skills Assessment produced a score for communications and interpersonal skills, a score for data gathering, and a documentation score. These latter two scores were combined to produce a composite score.1 The United States Medical Licensing Examination (USMLE) Step 2 Clinical Skills (CS) Examination produces four separate scores: (1) communication and interpersonal skills, (2) spoken English proficiency, (3) data gathering, and (4) documentation. Again, these latter two are combined to produce a single composite score.2

The fact that these assessments produce multiple scores, and the fact that examinee performance from individual encounters may contribute to each of these scores, may make classical test theory models inappropriate for evaluating the precision of the scores and assessing the relationships among the multiple proficiencies. Multivariate generalizability theory was developed specifically to overcome the limitations of classical test theory in these contexts.3,4 Surprisingly, there has been relatively little research using multivariate generalizability theory to evaluate scores from clinical skills assessments. Margolis and colleagues1 used multivariate generalizability theory to examine the precision of scores for the ECFMG's clinical skills assessment. The use of this framework allowed them to examine the relationships between the proficiencies measured by the multiple scores. The results supported the practice of combining scores from data gathering and documentation to create a composite; the correlation between these two proficiencies was relatively high compared with the correlation between either of these proficiencies and that for the communication score.

Clauser and colleagues2 examined the same relationships in the context of the USMLE Step 2 CS Examination. Again, the results supported combining the data-gathering and documentation scores to produce a composite. Both of these papers also examined the extent to which measurement error was correlated across the multiple scoring components—that is, the extent to which an examinee who does unexpectedly well on one component of a particular encounter is likely to do unexpectedly well on another component. These papers make a contribution to the understanding of scores from clinical skills assessments, but much remains unexamined.

Although it is common that items representing history taking and physical examination are combined to form a composite score, relatively little attention has been given to the examination of the relationship between these proficiencies. By tradition, physical examination and history taking are combined as though it were self-evident that these are, if not identical, very closely related skills. This view has been accepted largely in the absence of empirical evidence.

The present research continues the work begun by Clauser and colleagues.2 That paper treated the combination of checklist items relating to history taking and physical examination as a given. The results showed a reasonably strong relationship between the combined data-gathering score and the documentation score; this result supported combining these two scores to produce a composite. The present research examines the relationship between the component history-taking and physical examination scores and the relationship of these component scores with the documentation score.


This study examines data from the USMLE Step 2 CS Examination. Because all data used in this study were collected as part of normal operational testing and all examinees included in the data set had given approval for their responses to be used in research, IRB approval was not obtained. To ensure examinee anonymity, only group-level results are reported. Examinee identifiers were removed from the data before completing the analysis.

In the CS examination, each examinee interacts with 12 standardized patients. The examinee has up to 15 minutes to collect a patient history and complete a focused physical examination. The examinee then documents findings in a structured patient note. While the examinee is writing the note, the standardized patient completes three assessment tools. The focus of the present research is on the dichotomously scored checklist used to record whether the examinee asked about important aspects of the patient's history and performed identified physical examination maneuvers. The patient also evaluates the examinee's spoken English proficiency and communication and interpersonal skills using rating scales. The patient notes are subsequently rated by trained physicians.

The score for data gathering represents the proportion of points available within each case for which the examinee received credit, averaged across cases. Operationally, only a single data-gathering score is produced for each case. For this research, separate scores were produced based on the history items and physical examination items. The scores were based on the proportion of available points for which the examinee received credit within a category within the case. A small proportion of checklists do not have physical examination items; these cases were dropped from the analysis.

As noted previously, a multivariate generalizability theory framework was used for the analysis. In this framework, analysis of variance procedures are used to estimate variance components representing the contribution of various effects to the total score variance. This framework also provides a basis for predicting how the measurement precision would change as a function of test length. Multivariate generalizability theory extends the framework to include covariance components as well as variance components. This allows the researcher to examine the precision of individual scores, as well as the relationship between the proficiencies measured by different scores, the precision of composite scores produced through the weighted combination of individual scores, and the extent to which sources of error are correlated across scores.

The analyses were implemented using the mGENOVA software.4 Within each test administration session, the 12 examinees were each scored on 10 encounters. Ten encounters were selected to ensure that all encounters had a score for both history taking and physical examination. A total of 153 test sessions that met these criteria took place during the selected three-month period at the end of 2007 and beginning of 2008; this resulted in a total sample of 1,836 examinees. Within each session, there was a crossed design (i.e., each examinee in the session completed the same 10 cases). Similarly, the same 10 cases were scored for each examinee across each of the three proficiencies of interest: history taking, physical examination, and documentation. To take advantage of this crossed design, the analyses were run within sessions, and the resulting variance components were averaged across sessions.

The estimated variance and covariance components were used to calculate generalizability coefficients and phi coefficients. These coefficients were estimated for each of the individual scores and for a composite score representing a weighted average of the history-taking and physical examination scores.


Table 1 contains the variance and covariance components from the analysis. Because the analysis was completed using a crossed design, it is possible to estimate variance components for examinees, cases, and a term that includes the examinee-by-case interaction and other effects not accounted for in the design. These estimates are displayed on the diagonal within each effect for each of the three scores. The covariances, representing the relationship between score effects across the scores, are presented below the diagonal; the values above the diagonal are the correlations between score effects. For example, 0.00223 is the examinee variance component for the history-taking score, 0.00179 is the covariance of the examinee effect for history taking and that for the physical examination score, and 0.452 is the correlation between the same effects.

Table 1:
Variance and Covariance Components, Generalizability, and Phi Coefficients for History Taking, Physical Examination, and Documentation Scores

The variance components presented in Table 1 are identical to the values that would be produced if univariate analyses were run for each of the scores separately. This allows calculation of the generalizability and phi coefficients for each of the scores. These values are also displayed in Table 1. The generalizability coefficient reported for the documentation score is somewhat higher than that reported in previous analyses of data from this examination, but similar.2 Because the component history-taking and physical examination scores are each based on a subset of the items, it is not surprising that the generalizability coefficients for these scores are lower than corresponding values reported previously for the composite data-gathering score. The phi and generalizability coefficients indicate that the physical examination score is less generalizable than the score for history taking.

The correlations for the examinee effect can be interpreted similarly to true-score correlations. That is, they represent the strength of relationship between the proficiencies measured by the scores after removing the effects of measurement error. These results suggest a reasonably strong relationship between history taking and documentation and a weaker relationship between history and physical examination or physical examination and documentation.

The correlations for the case effect reflect the tendency for a case that is relatively difficult with respect to one of the proficiencies to be similarly difficult with respect to another proficiency. The results show a modest correlation between history and physical examination and between history and documentation, but almost no relationship between the relative difficulty of the physical examination challenge on a case and the corresponding challenge for documentation. The correlations for the examinee-by-case effect reflect the extent to which an examinee who does unexpectedly well on a given score for one case might be expected to do similarly well on another score on that same case. There is a modest effect for the relationship between history taking and documentation and little if any relationship between the other two pairs of scores.

Figure 1 shows the generalizability and phi coefficients for a composite score based on the weighted average of the history-taking and physical examination scores. The generalizability coefficient in this design could be viewed as equivalent to coefficient alpha. That is, it provides an estimate of the correlation between scores on equivalent forms of the examination. This provides an appropriate index of reliability when comparisons are to be made between examinees and all examinees have completed the same forms of the examination. The phi coefficient provides an estimate of reliability that is appropriate when domain-referenced score interpretations are to be made or when comparisons are to be made between examinees who have completed different forms of the examination. The figure shows how these coefficients change as a function of the proportion of weight assigned to the history-taking score in the composite. When all of the weight is assigned to history taking, the coefficients are the same as those that would be produced in a univariate analysis of the history-taking score. When no weight is assigned to history taking, the result is that for the univariate analysis for the physical examination score. Figure 1 demonstrates that the choice of weights makes a substantial difference in the precision of the scores. Because the history-taking score is more generalizable than that for physical examination, the composite reaches maximum generalizability when more weight is assigned to history taking. Although the generalizability coefficient is maximized when the proportional weight for history taking is around 0.80, weights ranging from around 0.70 to 0.85 produced similar results.

Figure 1:
This graph shows the phi and generalizability coefficients for a weighted composite combining physical examination and history-taking scores. The results vary as a function of the proportion of weight assigned to each of the scores.


This paper examined the relationship between scores associated with history-taking and physical examination items on checklists from USMLE Step 2 CS. Two main results warrant discussion. First, this analysis suggests that the proficiencies represented by these two scores have only a modest to moderate relationship. Given the common practice of combining scores from these two areas, this result may be surprising because, typically, composites are only formed from scores that have a relatively strong relationship. Despite these results, it is understandable why items from these areas are commonly combined; a physician will likely require both sources of data to form a diagnosis, and in some cases there may even be a level of redundancy between these two sources of information.

The relatively low correlation between these proficiencies does not automatically make it inappropriate to combine the scores. The rationale regarding how the scores are to be interpreted may be more important in making this decision than statistical considerations. This low correlation does argue that the composite score may provide a relatively limited basis for drawing conclusions about an examinee's performance in the separate areas. It would be inappropriate to assume that an examinee with an average data-gathering score is of average proficiency in both history taking and physical examination. This suggests that when sufficiently reliable subscores can be produced, separate scores for these two areas may be useful.

A second interesting result from this analysis is that the history-taking score is more reliable than the physical examination score. This may suggest that the specific skills required for completing a physical examination are less generalizable across cases than those for history taking, but it should be remembered that there are typically fewer physical examination items than history-taking items on each checklist. Additional analysis is needed to understand whether this result is a function of checklist length or a characteristic of the proficiencies.

Although the results reported in this paper provide important information about the practice of combining history-taking and physical examination items to produce a composite score, the reader should be aware that the results are limited to the context in which the data were collected. It may be that with alternative procedures for developing checklists, the relationships between these proficiencies would change. It is also possible that examinees may perform differently in a high-stakes assessment (such as the Step 2 CS Examination) than they would in a training or practice setting.


1 Margolis MJ, Clauser BE, Swanson DB, Boulet JR. Analysis of the relationship between score components on a standardized patient clinical skills examination. Acad Med. 2003;78:S68–S71.
2 Clauser BE, Harik P, Margolis MJ. A multivariate generalizability analysis of data from a performance assessment of physicians' clinical skills. J Educ Meas. 2006;43:173–191.
3 Cronbach LJ, Gleser GC, Nanda H, Rajaratnam N. The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles. New York, NY: John Wiley & Sons, Inc; 1972.
4 Brennan RL. Generalizability Theory. New York, NY: Springer-Verlag; 2001.
© 2009 Association of American Medical Colleges