Simulation technology is currently being explored as one method for assessing physical examination clinical competence in high-stakes examinations, such as the Royal College of Physicians and Surgeons of Canada’s Comprehensive Examination in Internal Medicine (RCPSC IM examination)1 and the United States Medical Licensing Examination.2 One widely studied arena for simulated physical abnormalities is the cardiac physical examination, with available cardiac simulations ranging from audiotaped recordings of real patient findings3 to cardiac patient simulator (CPS) mannequins.4
Integral to the use of simulators in assessment is the development and validation of tools to measure the educational outcomes. In a recent study assessing internal medicine physicians’ physical examination competence using a CPS compared with their competence with real cardiac patients, the limitations of the currently available outcome assessment measures became apparent.5 Of the two subjective outcome measures (physical examination technique and overall competence), physical examination technique was not a stand-alone measure of competence. Despite an orientation for assessors, the components contributing to the overall rating of clinical competence varied between real patient, standardized patient (SP) with associated audio-video cardiac abnormalities, and CPS modalities.6 For the two objective outcome measures of diagnostic accuracy and clinical findings, the examiners’ recording of clinical findings was sufficiently problematic that data analysis was not possible (the interrater reliability of interpreting and scoring these transcripts was very low).
Before additional research into the use of simulators and real patients for assessment purposes can be completed, more reliable and valid outcome measures must be developed.5 The ideal method for rating a CPS physical examination station is unknown. The extensive literature on assessment using SPs supports the use of global rating scales in the hands of expert, well-trained assessors.7 When raters are outside their range of expertise or less well-trained, checklists with SPs may provide a more reliable assessment format although they can be insensitive to increasing levels of expertise.8 Most of this SP work is based on items of history, not on abnormal physical findings, and it is not known whether the checklist or global rating approach (or a combination of the two approaches) is most suitable to a simulator-based assessment of physical examination competence.
In the current report, we describe the development and validation, during a high-stakes national specialty examination, of an approach to scoring using a cardiac findings checklist that may be used as one outcome measure for simulator-based assessments.
Development of the cardiac findings checklist followed rigorous procedures.9 The checklist was derived by including all the potential findings present on the CPS for all diagnoses that were included in the study and including a column for “other” if the participant stated unanticipated findings. The list of correct cardiac findings for each diagnosis included, but was not limited to, the most accurate cardiac findings for a specific diagnosis based on relevant literature such as JAMA’s Rational Clinical Examination series.10 The checklist was organized according to the standard presentation of cardiac findings (Appendix). The items and organization were reviewed by the CPS developers and the outcome assessors.
The checklist was implemented, for investigational purposes only (ie, not contributing to the final examination score), during the 2007 RCPSC IM examination. At the 2007 RCPSC IM examination, one station each examination day assessed the candidate’s cardiac physical examination competence using a CPS mannequin. The candidate entered the examination room, was read the identical, nonleading clinical scenario and asked to perform a relevant and focused cardiac examination on the CPS. As in all bedside stations at this examination, the candidates were expected to “think-aloud,” describing the expected or observed cardiac findings as they examined the patient. At the end of their physical examination, the candidates verbally provided a final cardiac diagnosis for the patient. Each examination day, one of five cardiac diseases was programmed into the CPS. The five cardiac diseases were chosen by the Examination Board, independent of the CPS developers, based on disorders that were common and/or important for general internists.11
Two trained raters (one RCPSC examiner and one observer) were present at the CPS stations of 251 candidates. The pairs of raters varied by examination day. The observer completed the cardiac findings checklist during the station, by recording the individual cardiac findings a candidate verbally reported as they examined the CPS. This approach to recording the findings was chosen over having the candidate complete a checklist to be as minimally intrusive as possible on the usual examination process. The RCPSC examiner recorded the candidate’s official station score, based on a rating for physical examination technique and accuracy of final diagnosis. As the examiner’s primary focus was on determining the candidate’s official station score, the examiner was not asked to complete the cardiac findings checklist in this initial investigation. For the purposes of our study, both raters independently assigned a global rating of cardiac physical examination competence, considering the candidate’s physical examination technique, accuracy of reported cardiac findings, and accuracy of final diagnosis.
Ethics approval for the study was provided by the Ottawa Hospital Research Ethics Board.
Validation of a Checklist Approach to Scoring
The same two investigators independently scored all completed checklists. The answer key corresponded to the cardiac findings determined by the CPS computer program for each diagnosis used during the 2007 RCPSC IM examination, with a minimum of one correct item for each response category (ie, palpate pulses, jugular venous pressure height, jugular venous pressure waveform, etc.; Appendix). Incorrect findings were cardiac findings verbalized by the candidate that were incorrect; lack of verbalization in a response category was neither scored as correct nor incorrect. Candidate responses were scored for the number of correct findings and the number of incorrect findings. The candidates’ final score for both the correct and incorrect findings was based on a consensus of the two investigators.
Interrater reliability for scoring the number of correct and incorrect cardiac findings was calculated using an intraclass correlation, for the dataset as a whole and by cardiac diagnosis. As the content of the examination is confidential, we are unable to name the specific cardiac diagnoses, and have labeled them diagnosis A, diagnosis B, etc. For data analysis, the CF%accurate score was calculated as:
(no. correct − no. incorrect findings)/no. total possible correct findings × 100%.
To address the relationship between correct and incorrect findings, a Pearson correlation coefficient was calculated between candidates’ mean number of correct and mean number of incorrect findings. Pearson correlation coefficients were used to compare CF%accurate scores with the raters’ global rating of cardiac physical examination competence on the CPS station and overall performance on the RCPSC IM oral examination.
The interrater reliability for the global rating of cardiac physical examination competence was 0.93 (range, 0.90–0.96). Overall, the interrater reliability for scoring correct cardiac findings was 0.95 (95% confidence interval: 0.94–0.96) and for scoring incorrect cardiac findings was 0.72 (95% confidence interval: 0.65–0.77). The interrater reliabilities for each cardiac diagnosis are shown in Table 1. The candidates’ mean number of correct and incorrect findings by diagnosis and the mean CF%accurate scores are shown in Table 1. The Pearson correlation between mean correct and incorrect findings was −0.40.
The Pearson correlation coefficients between CF%accurate and the global rating of cardiac physical examination competence was 0.60 for the examiner and 0.64 for the observer. The correlations between CF%accurate and the global rating for each diagnosis are shown in Table 1. The Pearson correlation coefficient between CF%accurate and final RCPSC IM oral examination score was 0.31.
The current study describes an initial investigation into the use of a cardiac findings checklist as one measure of cardiac clinical competence as assessed using a CPS. The validity of this approach to scoring has been examined from multiple aspects, based on Downing’s framework for validity evidence.12
The CF checklist encompassed all of the cardiac findings for the cardiac diagnoses assessed using the CPS during the RCPSC IM examination. For this source of validity evidence, the response format was a straightforward checklist, with a write-in option for unexpected candidate responses.12 The frequency of write-ins varied with the observer and with the cardiac diagnosis, but generally occurred at least once per candidate. The interrater reliability for scoring the correct findings on the checklist was high, providing further evidence of the validity of the checklist in this setting.12
Another source of validity evidence is the relationship of the assessment tool to other variables.12 The CF%accurate score had a modest correlation with the candidate’s global rating of cardiac clinical competence and a lower correlation with their overall examination performance. The CF%accurate scores for each diagnosis varied in their correlation with the global ratings (Table 1) but there was no consistent pattern as to this variability (ie, the most difficult diagnoses had neither the highest nor lowest correlations). These overall modest correlations are not unexpected, as other components of performance such as physical examination technique and final cardiac diagnosis also contributed to the overall global rating of cardiac clinical competence. These results are consistent with a previous study which demonstrated variability in the weighting of physical examination technique and accurate cardiac diagnosis in the global rating of performance based on the simulation modality.6 It is likely that cardiac findings are similarly differentially weighted in the global rating by examiners depending on the cardiac diagnosis. However, the variability in the correlations is an important issue for examination planning, as it highlights the importance of testing across a number of diseases.
As most of the other stations of the RCPSC IM examination do not test the candidate’s ability to detect abnormal clinical findings (using SPs without physical abnormalities), a lower correlation between the CF%accurate score and the overall examination score may reflect the measurement of different aspects of competence. The negative correlation between correct and incorrect findings suggests that the candidates were not using a “shot-gun” approach in reporting the cardiac findings but that the most accurate candidates had the least number of incorrect findings and vice versa.
The current study is not without limitations. A single observer recorded the candidate’s verbal description of the cardiac findings, and the reliability of this recording process is unknown. When scoring the CF checklist, a lower interrater reliability for the scoring of incorrect findings was found, with variability in these reliabilities by cardiac diagnosis (Table 1). The sources of disagreement leading to this lower reliability appeared to be the write-in options on the checklist, as the diagnoses with the lowest interrater reliabilities for incorrect findings appeared to have the most write-ins. This is consistent with the results of our previous study,5 where one observer attempted to write verbatim the participant’s verbal report of cardiac findings for real patient and simulation-based modalities. When two investigators subsequently independently scored these written reports, this approach yielded a very poor interrater reliability. These write-in responses require interpretation by the individual scoring the checklist, compared with scoring the preset response options. Finally, the current study does not address the issue of how many stations would be necessary to have a valid CPS examination, irrespective of the outcome measure, as the CPS was only a single station in our internal medicine specialty examination.
The modest correlations of the checklist score with other measures of clinical performance indicate that the checklist is not a stand-alone measure but it does provide one additional objective outcome measure for use in assessment with a CPS. Our previous research had demonstrated modest correlations between cardiac physical examination performance using the simulator and real patients but did not include a cardiac findings checklist in the scoring rubric.5 Based on the current study, inclusion of the checklist as one objective measure may allow fairer comparison between assessment modalities in future research. Future research should address how best to incorporate a checklist approach into a scoring rubric for a CPS. In addition, future studies should compare performance between novices and experts using the checklist to further assess construct validity of the checklist. If a CPS is typical of other simulator-based assessments of physical examination competence, these modalities may benefit from the current study by considering inclusion of a checklist-based approach to scoring. The current study provides reasonable validity evidence for a cardiac findings checklist as one approach to assessing cardiac clinical performance with a CPS.
1. Hatala R, Kassen BO, Nishikawa J, Cole G, Issenberg SB. Incorporating simulation technology
in a Canadian national specialty examination: a descriptive report. Acad Med
2. Dillon GF, Boulet JR, Hawkins RE, Swanson DB. Simulations in the United States Medical Licensing Examination. Qual Saf Heath Care
3. Vukanovic-Criley JM, Criley S, Warde CM, et al. Competency in cardiac examination skills in medical students, trainees, physicians and faculty. Arch Intern Med
4. Issenberg SB, McGaghie WC, Hart IR, et al. Simulation technology
for health care professional skills training and assessment. JAMA
5. Hatala R, Issenberg SB, Kassen B, Cole G, Bacchus CM, Scalese RJ. Assessing cardiac physical examination skills using simulation technology
and real patients: a comparison study. Med Educ
6. Hatala R, Issenberg SB, Kassen BO, Cole G, Bacchus CM, Scalese RJ. Assessing the relationship between cardiac physical examination technique and accurate bedside diagnosis during an OSCE. Acad Med
7. Norcini J, Boulet J. Methodological issues in the use of standardized patients for assessment. Teach Learn Med
8. Hodges B, Regehr G, McNaughton N, Tiberius R, Hanson M. OSCE checklists do not capture increasing levels of expertise. Acad Med
10. Sackett DL. The rational clinical examination series. A primer on the precision and accuracy of the clinical examination. JAMA
11. Mangione S, Duffy FD. The teaching of chest auscultation during primary care training: has anything changed in the 1990s? Chest
12. Downing SM. Validity: on the meaningful interpretation of assessment data. Med Educ