For medical students, objective structured clinical examinations and standardized patient (SP)-based examinations have become the method of choice for assessing clinical skills.1–5 SPs are not only used for clinical skills assessments in schools4–7 and residency training programs8–10 but also used in high-stakes medical licensing examinations in Canada and the United States: Medical Council of Canada's Qualifying Examination Part II (MCCQE Part II), United States Medical Licensing Examination Step 2 test of Clinical Skills, and the Comprehensive Osteopathic Medical Licensing Examination Level 2 Performance Evaluation (COMLEX Level 2-PE).1,2 SPs are individuals who are trained to portray clinical scenarios.2,11 A limited amount of data has been published regarding the characteristics of SPs used in high-stakes testing. For the United States Medical Licensing Examination Step 2 test of Clinical Skills, SPs are actors, housewives, teachers, and retirees from diverse cultural backgrounds between the ages of 18 and 75.12 In addition, it has been reported that many of these SPs demonstrated personality traits such as extroversion and openness to new experiences.12
Numerous studies report that SP portrayals by laypersons are realistic13,14 and accurate.13–16 Tamblyn et al15 found that portrayal accuracy varied according to several factors including prior simulations or acting experience, a reported understanding of the simulated condition, and prior experience with a similar health problem. Studies have also shown that SPs are accurate in their recording of candidate performance for summative assessments and clinical skills examinations,12–14,16–19 and in some cases, they are more accurate than physician examiners.18 Previous studies have demonstrated that checklist recording accuracy is correlated with checklist clarity, rater training, checklist length, and time allotted for scoring.19
Despite the extensive research addressing portrayal and checklist scoring accuracy, investigating SP characteristics' effect on scoring accuracy has not been well documented. One such characteristic is acting or performing arts experience. Experienced actors are able to memorize lines more accurately than novice actors,20 and performing artists may have received additional training to improve memory and recall.21 Therefore, do SPs with performing arts experience display better recall and checklist item scoring accuracy? Although some SPs have theatrical performance backgrounds or acting experience, it is not a requirement of the position. In fact, the National Board of Osteopathic Medical Examiners experience with the COMLEX Level 2-PE has been that some SPs have little or no theatrical experience. SPs with performance backgrounds may have gained this experience through any number of means, such as high school theater, community theater, professional theater, college course work, performing arts degree, television, film, or comedy clubs. To our knowledge, no study has investigated whether performance background of SPs affects the accuracy of recording history taking and physical examination checklist items for a high-stakes clinical skills medical licensing examination.
The purpose of this study is to investigate whether SPs who have identified themselves as performing artists and nonperforming artists complete checklists with the same level of recording accuracy. It is hypothesized that there is no difference in the mean recording accuracy on overall data gathering, history taking, and physical examination items between SPs who identified themselves as performing artists (SPperforming artists) and SPs who identified themselves as nonperforming artists (SPnonperforming artists). This knowledge could inform future recruitment and training efforts for large-scale SP testing programs.
The COMLEX Level 2-PE is an SP-based clinical skills examination for licensing osteopathic physicians.1–3 Candidates, typically third- or fourth-year osteopathic medical students, rotate through 12 stations interacting with SPs trained to simulate a variety of typical medical complaints. Based on the analysis of national practitioner databanks and expert opinion, the examination blueprint has been constructed so that cases cover a variety of content categories (respiratory, cardiovascular, neuromusculoskeletal, gastrointestinal, and other symptoms) and are designed to vary in age, gender, and race/ethnicity.22 Cases also vary in clinical complaints that could be acute, chronic, or provide opportunities for health promotion or disease prevention. Each of the 12 stations includes a 14-minute doctor-patient encounter followed by 9 minutes allotted to candidates for the completion of a written patient note (subjective, objective, assessment, plan; SOAP note) documenting their findings and formulating a differential diagnosis and treatment plan for the patient.
Candidates must pass two domains (Humanistic Domain and Biomedical/Biomechanical Domain) to successfully complete the examination. Failure in either domain results in failure for the entire examination. Several clinical skills component scores are used to measure performance in each of the two domains.
The Humanistic Domain is scored using the Global Patient Assessment Tool (GPA Tool) that consists of assessments for each of the following humanistic clinical skills: listening skills, respectfulness, empathy, professionalism, ability to elicit information, and ability to provide information. After the candidate completes the encounter, the SP records the candidate's performance on each of these six clinical skill areas using a Likert-type scale.
The Biomedical/Biomechanical Domain comprises three weighted component scores that reflect osteopathic medical knowledge across all scored clinical encounters: (1) data gathering (DG) score, which reflects the candidate's ability to obtain a medical history and perform a physical examination; (2) written patient note (SOAP note) score, which reflects the candidate's written communication skills and ability to synthesize information, develop a differential diagnosis, and formulate a diagnostic and treatment plan; and (3) osteopathic manipulative treatment (OMT) score, which reflects the candidate's ability to integrate osteopathic principles and perform OMT. For data gathering, case-specific checklist items are developed and reviewed using evidence-based medicine and best practices by expert consensus of osteopathic physicians with expertise in case development. The SP records data gathering performance after each encounter using a checklist containing dichotomously scored items that reflect questions asked about the medical history of the patient and physical examination maneuvers performed. The encounter-level data gathering score is then computed as a percentage by summing the number of history taking and physical examination checklist items recorded as done.
For each clinical skill component (GPA, DG, SOAP note, and OMT), SP and rater encounter level scores are averaged across encounters to compute the candidate's mean clinical skill component score. Mean clinical skill component scores are calibrated to adjust for SP and rater leniency and stringency. The Humanistic Domain scale consists of the calibrated GPA mean. The Biomedical/Biomechanical Domain scale is the weighted sum of the calibrated DG, SOAP note, and OMT means. Pass/fail standards are determined using examinee-centered standard setting methods that focus on deriving a cutscore that appropriately reflects minimal competency for entry into graduate medical education. The validity and reliability of these measures have been reported in previous studies.1,2,3,5
The number of employed SPs varies throughout the testing cycle, depending on testing volume and case utilization. Of the 51 SPs available during the study period (July 2009), 49 SPs volunteered to participate in the study. SPs assigned to more than one case were excluded to ensure the independence of the sample. The sample of 40 SPs that was retained for our study was comparable with the entire pool of SPs used for the examination and were subjected to the same rigorous training and quality assurance protocols. Training of SPs is a competency-based process coordinated and monitored by SP trainers, physician trainers, experts in doctor-patient communication skills, and psychometricians; SPs must demonstrate proficiency in portrayal and scoring before participating in live examinations. In addition to age, gender, and educational background, SPs who volunteered to participate were asked to identify themselves as being performing artists or nonperforming artists. Examples of performing artists included actor, improviser, singer, comedian, dancer, and other. SP participants all signed an informed consent agreement, and Institutional Review Board approval was received to collect and analyze these data. SPs were grouped by those who identified themselves as performing artists (N = 30) and those who identified themselves as nonperforming artists (N = 10).
Standard and Quality Assurance Clinical Encounter Checklist Ratings
Case-specific history taking and physical examination checklist ratings (standard ratings) were documented by SPs immediately after the clinical encounter during the testing day. Quality Assurance (QA) data were collected throughout the 2008–2009 testing cycle to provide a second set of scores via review of videotaped recordings. All SPs received the same level of supervised training and quality assurance monitoring for both methods of scoring. SPs scoring these live encounters were limited to the 9 minutes allotted for the patient note task after the patient encounter was completed. Conversely, these time limits were not imposed for SPs completing the second set of scoring (QA) via video review. SPs scoring via video were also afforded the opportunity of reviewing the encounter from two different camera angles, minimizing the number of obstructed views.
Agreement between standard and QA ratings was used as the measure of checklist accuracy, coded dichotomously as 1 = agree and 0 = disagree. Proportion of agreement was computed as the sum of agreement across checklist items by examinees, ranging from total disagreement (0) to perfect agreement (1). To ensure the independence of the two ratings, only encounters where the live rater and the QA rater were two different raters were included for analysis. The final sample for analysis included 1972 encounters from the 2008–2009 testing cycle over 40 cases among 40 SPs with checklist length ranging from 16 to 22 items per case. Both uncorrected and corrected measures of agreement were computed for history taking, physical examination, and overall checklist accuracy. For the corrected measures of agreement, Cohen's Kappa (κ) coefficient was computed to assess the proportion of agreement above and beyond chance agreement.23 Mean overall data gathering accuracy ratings for history taking (DGHx), physical examination (DGPE), and overall data gathering (DGoverall) were computed for each SP in the study.
Survey data and SP accuracy data were organized and analyzed by SP performing artist group (SPperforming artist or SPnonperforming artist).
Because of the relatively small numbers of SPs in each group with unknown population distributions, the assumption of normality for the mean percentage of agreement might not be met. Therefore, the nonparametric Wilcoxon-Mann-Whitney exact test for independent samples was used to test whether distributions of agreement differed between SPperforming artist and SPnonperforming artist for history taking, physical examination, and total test items.24 The latter statistic allows analysis to determine whether there is a significant difference in the distribution of agreement (ie, median, variance, skewness, and kurtosis) between the two performing artist groups. A nominal Type I error rate of 0.05 was used for all three analyses.
As shown in Table 1, among the sample of SPs, 30 (75%) identified themselves as performing artists and 10 (25%) identified themselves as nonperforming artists. Of study respondents, 25 (62.5%) were male and 15 (37.5%) were female. No significant difference was shown between performing artist groups by gender. Of the female SPs, 10 of 30 were in the performing artist group, and 5 of 10 were in the nonperforming artist group (P < 0.457, two-tailed Fisher exact test). The mean reported age was 45 (range, 23–78 years). There was no significant difference with regard to age on average between performing artists (mean = 42.8, standard deviation = 14.2) and nonperforming artists (mean = 52.5, standard deviation = 16.8), t(38) = 1.788, P = 0.082. A wide range of educational backgrounds was observed within both groups. Level of degree attained was not significantly different between the performing artist group and the nonperforming artist group, χ2(3, N = 40) = 2.94, P = 0.401.
Cohen Kappa (κ) coefficient was computed for each checklist item to determine agreement above and beyond chance agreement between the live and QA ratings. Mean and standard deviation Kappa coefficient values for history taking, physical examination, and overall checklist accuracy ratings are presented in Table 2. Mean Kappa coefficient values were 0.794 for history taking items, 0.789 for physical examination items, and 0.792 for overall data gathering. Using the guidelines of Landis and Koch25 for magnitude, the results are indicative of substantial agreement between live and QA raters above and beyond chance for all three domains.
Wilcoxon-Mann-Whitney tests of equal locations were carried out on the mean percentages of checklist accuracy between SPperforming artists and SPnonperforming artists for each of the following scores: DGHx, DGPE, and DGoverall. There was no statistically significant difference between the two groups with respect to any of the mean accuracy measures: history taking (z = −0.422, P = 0.678), physical examination (z = −1.453, P = 0.072), and overall data gathering (z = −0.812, P = 0.417) checklist items.
Consistent with previous studies,12 SPs who participated in the study varied in age, gender, and educational background. In addition, SPs reported a wide range of diversity with regard to educational degrees obtained. SP study respondents reported that 10% did not have an educational degree, 5% had an associate degree, 62.5% had a bachelors degree, and 22.5% had a graduate degree.
Data gathering checklist accuracy between SPperforming artists and SPnonperforming artists did not differ for history taking, physical examination, or overall data gathering. Both groups demonstrated high levels of precision for a clinical skills examination with accuracy measures all above 92.5%, meeting or exceeding those found in previous studies.12–14,16–19,26 For instance, De Champlain et al26 reported average proportion of agreement rates of 0.90.
Given the extensive training and quality assurance protocols required for all SPs used for high-stakes clinical skills testing,27 it is not surprising that SPs completed checklists with the same level of precision regardless of performing arts background. The study sample included only those SPs who completed the competency-based training program and were eligible to score live encounters. Checklist accuracy as measured throughout the training period was not investigated in this study. With regard to scoring, would SPs with performing arts experience require less training than SPs without experience? Because previous studies have demonstrated that SPs with acting experience have superior portrayal accuracy15 and performing artists may have received additional training to improve memory skills,21 it is plausible to hypothesize that SPs with performing arts backgrounds are advantaged during the training period. Studying how SP characteristics affect scoring during the training period would be of interest for future study. In this study, SPs from both groups were subjected to the same rigorous competency-based training and continual quality assurance. Therefore, any baseline differences between the groups were reduced so that SPs who completed the training program demonstrated high levels of scoring precision, regardless of performing arts background.
Various quality assurance methods have been used to assess SP accuracy and reliability.19 For this study, the difference between live scores and scores obtained during videotape review was used to calculate mean accuracy ratings. Videotaped review is considered more accurate, because SPs are not limited by the examination timing and have the opportunity to rewind and replay candidate performance. As summarized by Heine et al,19 a majority of studies that address SP accuracy use inter-rater reliability between SPs and/or other observers; SP ratings may be compared with ratings of other SPs, expert ratings, faculty, or combinations of the above. For the purposes of this study, SPlive-SPvideotape review checklist scoring comparisons were chosen as the measure of inter-rater reliability and accuracy.
Though encouraging, this study has a few limitations. First, although the sample includes a large population of SPs presented to osteopathic medical students, it may not be representative of all SPs trained for use in medical licensure or education. To further evaluate the accuracy of SPperforming artists versus SPnonperforming artists and the generalizability of the results, the study should be replicated at other institutions, perhaps both in training and high-stakes testing institutions. Second, analysis was based on SP's self-identification as performing artist or nonperforming artist. Further analysis could involve correlating checklist item accuracy with other detailed SP characteristics, such as type of performing artist, amount of performance experience, or number of venues performed. Third, the number of history and physical examination checklist items varied between cases. The relationship between checklist item length and recording accuracy has been reported19; this relationship was not addressed in this current study but could be a topic for future research. In addition, the focus of this study was to investigate the accuracy of recording checklist items and not the accuracy in portrayal or in recording global assessments related to communication or interpersonal skills. Investigating portrayal accuracy, as it relates to performing arts background among SPs, is of interest and may be a topic for future study.
SPs are often stereotyped as “actors” with theatrical performance experience. However, results of the study show that 25% of respondent SPs did not self-identify as an actor or other performing artist. Data gathering checklist accuracy between performing artists and nonperforming artists did not differ in a statistically significant fashion. Results show high levels of precision when recording history taking and physical examination checklist items for both groups. Therefore, results suggest that individuals with or without performance backgrounds can be recruited and used for high-stakes clinical skills licensing examinations such as the COMLEX Level 2-PE, without sacrificing examination integrity or scoring accuracy.
The authors thank Kristie Lang for her critical review of the manuscript and all the SPs who participated in the study for their dedication and commitment to the National Board of Osteopathic Medical Examiners and COMLEX Level 2-PE.
1. Langenau E, Dyer C, Roberts WL, Wilson CD, Gimpel JR. Five year summary of COMLEX-USA Level 2-PE examinee performance and survey data. J Am Osteopath Assoc
2. Boulet JR, Smee SM, Dillon GF, Gimpel JR. The use of standardized patient assessments for certification and licensure decisions. Simul Healthc
3. Gimpel JR, Boulet JR, Errichetti AM. Evaluating the clinical skills of osteopathic medical students. J Am Osteopath Assoc
4. Barrows HS. An overview of the uses of standardized patients for teaching and evaluating clinical skills. Acad Med
1993;68:443–451; discussion 451–453.
5. Errichetti AM, Gimpel JR, Boulet JR. State of the art in standardized patient programs: a survey of osteopathic medical schools. J Am Osteopath Assoc
6. Townsend AH, McLlvenny S, Miller CJ, Dunn EV. The use of an objective structured clinical examination (OSCE) for formative and summative assessment in a general practice clinical attachment and its relationship to final medical school examination performance. Med Educ
7. Prislin MD, Fitzpatrick CF, Lie D, Giglio M, Radecki S, Lewis E. Use of an objective structured clinical examination in evaluating student performance. Fam Med
8. Altshuler L, Kachur E, Krinshpun S, Sullivan D. Genetics objective structured clinical exams at the Maimonides Infants & Children's Hospital of Brooklyn, New York. Acad Med
9. Kligler B, Koithan M, Maizes V, et al. Competency-based evaluation tools for integrative medicine training in family medicine residency: a pilot study. BMC Med Educ
10. Cohen R, Reznick RK, Taylor BR, Provan J, Rothman A. Reliability and validity of the objective structured clinical examination in assessing surgical residents. Am J Surg
11. van Zanten M, Boulet JR, McKinley D. Using standardized patients to assess the interpersonal skills of physicians: six years' experience with a high-stakes certification examination. Health Commun
12. Furman GE. The role of standardized patient and trainer training in quality assurance for a high-stakes clinical skills examination. Kaohsiung J Med Sci
13. Williams RG. Have standardized patient examinations stood the test of time and experience? Teach Learn Med
14. Colliver JA, Reed WG. Session two: technical issues: test application. Acad Med
15. Tamblyn RM, Klass DK, Schanbl GK, Kopelow ML. Factors associated with the accuracy of standardized patient presentation. Acad Med
16. De Champlain AF, Margolis MJ, King A, Klass DJ. Standardized patients' accuracy in recording examinees' behaviors using checklists. Acad Med
17. Pangaro LN, Worth-Dickstein H, MacMillan MK, Klass DJ, Shatzer JH. Performance of “standardized examinees” in a standardized-patient examination of clinical skills. Acad Med
18. Elliot DL, Hickam DH. Evaluation of physical examination skills: reliability of faculty observers and patient instructors. JAMA
19. Heine N, Garman K, Wallace P, Bartos R, Richards A. An analysis of standardized patient checklist errors and their effects on student scores. Med Educ
20. Noice H. Effects of rote versus gist strategy on the verbatim retention of theatrical script. Appl Cogn Psychol
21. Noice H, Noice T. What studies of actors and acting can tell us about memory and cognitive functioning. Curr Dir in Psychol Sci
22. Boulet JR, Gimpel JR, Errichetti AM, Meoli FG. Using National Medical Care Survey data to validate examination content on a performance-based clinical skills assessment for osteopathic physicians. J Am Osteopath Assoc
23. Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas
24. Narayanan A, Watts D. Exact methods in the NPAR1WAY Procedures. SUGI Proceedings, 1996. Available at: http://support.sas.com/rnd/app/papers/exact.pdf
. Accessed October 28, 2010.
25. Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics
26. De Champlain AF, Macmillan MK, Margolis MJ, King AM, Klass DJ. Do discrepancies in standardized patients' checklist recording affect case and examination mastery-level decisions? Acad Med
27. Furman GE, Smee SM, Wilson CD. Quality assurance best practices for simulation-based examinations. Simul Healthc