The need for reliable and valid assessment of the many facets of clinical competence has resulted in the development of a variety of assessment instruments. In 1999, Pangaro1 designed a new instrument, the Reporter-Interpreter-Manager-Educator (RIME) model, for in-training evaluation of medical students. Since then, the RIME model has gained widespread popularity and acceptance in many different institutions and specialties.2–7 The RIME framework is popular among students and faculty5,8 and is considered reliable for in-training assessment purposes.9 The model possesses the potential to reflect progression in clinical competence from novice to expert10 through four stages: the level of gathering information (Reporter), analyzing and prioritizing patient problems (Interpreter), managing a plan for the patient (Manager), and demonstrating reflection and education of others (Educator).11 The popularity of the RIME framework is in part explained by the ease of relating it to Bloom’s taxonomy of learning,12 as the four elements represent a developmental framework of progressively higher cognitive skills achieved by the learner—that is, from data gathering to analysis, synthesis, and evaluation.3
However, although the RIME elements are assumed to support desirable professional development of clinical reasoning processes,1,11 the empirical evidence of this kind of construct validity is sparse. One study demonstrated that, in a group of third-year students, two-thirds were at the Manager/Educator level of the RIME framework, and their RIME scores had clear associations with their end-of-clerkship examination scores.2 This provides some evidence of construct and concurrent validity of the RIME framework. However, whether the RIME elements possess construct validity, in terms of reflecting progress in clinical competence, has never been demonstrated.
The RIME model was designed for in-training assessment purposes using a panel of assessors for evaluating students’ clinical performance during clerkships.1,13 However, the four RIME elements may also serve as a structure for end-of-clerkship oral exams of students’ patient encounter skills.14 Adding a bit of structure in oral exams can be of help to clinician examiners and improve reliability of the assessment.15 The RIME framework could well serve as this structure. Whether a RIME-structured scoring form used for end-of-clerkship oral exams is reliable and feasible in the hands of clinician examiners is not known.
The aim of this study, carried out in Denmark, was to explore the validity, reliability, and use of a scoring format structured according to the four RIME elements for assessing students’ competence in managing the patient encounter. The aim was pursued in two steps.
- The first step was to investigate the construct validity in an experimental study. The research question was, “Do assessment scores on the four RIME elements progress according to students’ increasing clinical experience?”
- The second step was to investigate the reliability and use of a RIME-structured scoring form for the oral examination of students’ patient encounter skills. The research question was, “How do clinician examiners in surgery and internal medicine use a RIME-structured scoring form in the oral examination of students’ patient encounter skills, and what is the interrater reliability of the scorings?”
We designed a structured scoring form for assessing students’ patient encounter skills according to the four RIME elements. The scoring form had four categories of items, one for each RIME element (see the Appendix for a copy of the form). Each RIME category included two to three items to be assessed using a five-point rating scale (0–4, with 4 the highest score), giving rise to a possible maximum score of 60. In addition, the scoring form included an overall global mark on a scale of 1 to 7. The content of the RIME elements was adapted on a few issues to fit the purpose of assessing patient encounter skills. One item regarding professionalism and psychosocial/ethical patient problems was included in the Interpreter category. This was done according to the growing evidence of the need for addressing these issues.16 The Educator category was adapted to include items of individual learning goals related to specific and general aspects of the patient case. This aimed at enabling the examiners’ evaluation of reflective learning skills and the issue of self-assessment—or “knowing when to look it up”17—which is central to professional development and important from a patient safety point of view. Moreover, this was anticipated to accommodate problems of varying difficulty in cases used for end-of-clerkship oral examinations.
Validity was explored in two steps: an experimental study and an observational study. Both are described below. Ethical approval, in terms of an exemption letter from the regional ethical committee of the Capital Region, Copenhagen, Denmark, was obtained before conducting the study.
The experimental study
The construct validity of the RIME structure was investigated in a randomized, controlled, single-blind trial.
Three groups, consisting of 16 fourth-year medical students, 16 sixth-year medical students, and 16 postgraduate year 1 interns, were included in the experimental study. The participants were recruited by mailing 430 fourth- and sixth-year medical students at the University of Copenhagen as well as 72 postgraduate year 1 interns in Eastern Denmark. Participants were recruited on a first-come, first-served basis and were paid a minor honorarium for their participation.
Four simulated patients (i.e., trained actors) were used to portray two different cases—one in surgery and one in internal medicine. The cases were common in nature (respiratory distress and acute abdomen). Each study participant completed both cases.
Four associate professors (two surgeons and two internists) were used as assessors. The associate professors were instructed to use the RIME scoring format and rate students’ performance according to the standards of performance expected of year 1 interns. The patient encounters were videotaped to allow a second independent judgment of each examinee. Thus, a live assessment and a video assessment were performed by two different associate professors. The assessors were blind to the experience levels of the participants. The cohorts of medical students at the University of Copenhagen were large, and the chance of familiarity between assessors and students or interns was considered minimal. (In fact, none of the associate professors recognized any of the participating students or interns.) The two surgeons rated only surgical cases, and the two internists rated only cases in internal medicine. A pilot study including four participants was performed to test the experimental setup and to let the assessors become acquainted with the RIME scoring format. Assessors were instructed to formulate each RIME element into open-ended questions that would lead the students through the framework.
The participants were instructed to decide how to manage each of the two patient encounters and answer the questions asked by the assessors. The participants had 25 minutes for each encounter.
Results were processed using SPSS 18 software. Missing values were replaced by means from within the category of items (e.g., if an element in the Reporter category was missing, it would be replaced by means from the rest of the items within the Reporter category). In total, 43 items out of 2,112 items (2.0%) were missing. Scorings were transformed into percentages of maximum scores for further analysis. Means and standard deviations (SDs) were calculated for the total RIME scores and for each of the four RIME categories. The three groups were compared on these variables using ANOVA with Bonferroni post hoc corrections for multiple comparisons. Effect sizes of differences were calculated using Cohen D. The total RIME scores were associated with overall grades using Pearson correlation. Finally, interrater and intercase reliability were estimated by calculating intraclass correlation coefficients (ICCs). Mean scores across raters were used for estimating intercase reliability.
The observational study
The RIME-structured scoring form was introduced as the framework for the end-of-clerkship oral examination at our university of three consecutive cohorts of fourth-year students from spring 2009 to summer 2010. This form included the possibility of rating an item “not relevant.”
Context of the study.
The medical program at our university, the University of Copenhagen, consists of a six-year traditional curriculum with basic science teaching during the first years and clinical sciences during the last years. Students have their first clerkship in their fourth year, eight weeks in internal medicine and eight weeks in surgery. The standard procedure for the end-of-fourth-year oral exam is based on an assessment of students’ performance on real patients in a summative, 30-minute, oral examination session. In 2009, the Centre for Clinical Education, University of Copenhagen, was asked to provide a scoring form for the exam as it went from a pass/fail to a graded exam without any increase in time. This resulted in the introduction of the RIME-structured scoring format. Students and examiners were informed about the RIME scoring format through presentations, workshops, Web-based communications, and written information. The oral exam focused on the management of one single-patient encounter in either general medicine or surgery. The examiners used the RIME-based scoring form during direct observation of student performance (see the Appendix for a copy of the form). The examiners formulated each RIME element as open-ended questions that would lead the students through the framework. However, there was no formal requirement of attending examiner training, which could range from two-hour workshops to 30-minute presentations in the clinical wards, and from written material to instructional e-mails.
The students were assessed in the clinical department that served as their second clerkship site (i.e., internal medicine for half of the students and surgery for the other half). One internal and one external examiner assessed each student. All examiners were clinicians at the academic level of associate professor or professor. The examiners rated each student’s performance according to the RIME scoring form, including giving an overall final grade according to what could be expected of students at this level. Thus, for each student, two separate RIME scoring forms were completed and sent to the research group for data analysis.
A total of 677 fourth-year students were included in three cohorts; these were all the fourth-year students at our university from spring 2009 to summer 2010.
Data were processed using SPSS 18 software. Scorings were calculated into percentages of maximum score for further data analysis. Means and SDs were calculated for the total RIME scores and for each of the four RIME categories. The four RIME category scores were compared across the three cohorts using ANOVA with Bonferroni correction for post hoc comparisons. The distribution of all scores including missing values and the use of the label “not relevant” in each RIME category was analyzed. Finally, ICCs were used for estimating interrater reliability.
The experimental study
The RIME-structured scoring forms, containing the assessors’ scores of the directly observed and video-observed performances of all 48 participants, were collected from the assessors. There were statistically significant differences in total RIME scores between the three groups: mean 41.7 (SD 11.0) for fourth-year students, 48.2 (SD 10.9) for sixth-year students, and 61.9 (SD 8.5) for interns, ANOVA, P < .0001. A post hoc analysis demonstrated statistically significant differences between sixth-year students and interns (P < .001) and between fourth-year students and interns (P < .0001), but not between fourth-year and sixth-year students (P = .24).
The scores of each RIME category progressed with increasing levels of experience (see Figure 1), and the variance of scores on each category decreased with increasing levels of experience. There was a moderate progression of scores related to the Reporter category and a substantial progression in scores on the Manager and Educator categories based on effect sizes (see Table 1). Furthermore, post hoc comparisons showed significant differences between fourth-year students and interns in all RIME elements as well as between sixth-year students and interns in elements of the Manager and Educator categories. There were no statistical differences between fourth-year and sixth-year students in Interpreter and Manager scores, but there was a statistically significant difference regarding Educator scores (see Table 1).
The correlation between RIME scores and overall grades was high, Pearson r = 0.95. The interrater reliability was ICC = 0.53. The intercase reliability was only ICC = 0.08.
The observational study
Fully or partially completed RIME examination scoring forms from 547 (80.8%) students were returned. Cases with 50% or more missing values in either internal or external examiners’ scoring forms were excluded, n = 94 (17.2%). The scoring forms from the remaining 453 students were used for the final data analysis.
The RIME score for the three consecutive cohorts of fourth-year students was mean 83.8 (SD 15.5). The distribution of scores across the RIME categories was evened out over the three consecutive terms. There were significant differences in Educator scores between the three groups (ANOVA, P = .033), as the students scored higher on the Educator category in the second and third terms (see Figure 2). The interrater reliability was 0.74 for the RIME scores and 0.96 for the overall grades. The distribution of scores in each RIME category, including missing values and “not relevant,” are listed in Table 2. The ethical/psychosocial item, as well as all items in the Educator category, had many missing values, and items were more frequently labeled “not relevant.”
This study explored the construct validity of a scoring form structured according to the RIME elements in an experimental study and the use of the scoring form for oral examination of students’ patient encounter skills in an observational study.
The experimental study showed that the RIME-structured scoring form discriminated significantly between three groups whose members had increasing levels of clinical experience. The study also demonstrated a progression across the four elements of the RIME framework according to the participants’ increasing levels of clinical competence. These results support the construct validity of the RIME structure as indicated in previous empirical studies and in accordance with theoretical anticipations of development of competence through the RIME framework.1,5–7 The study shows that some levels of expertise, such as the Manager and Educator ones, are not obtained until the final years in medical school and during the first postgraduate year. This is supported by previous studies1,2,11 and corresponds well to anticipations of students’ development of competence.11,18 In the experimental study, there was a span of six years of experience in the three groups, and, hence, we do not know whether the instrument can discriminate progression of clinical experience across the one year of that study. This problem has been described previously using the mini-CEX to monitor progress in a cohort of residents across one year.19
The observational study supported the findings from the experimental study regarding the distribution of the RIME scores in the first cohort of fourth-year students. However, the scores evened out over the subsequent two cohorts. It is possible that students over time became increasingly aware of what was being assessed—in other words, “students respect what we inspect.”20 It might also be explained by the use of the framework by the associate professors.
The acceptability of the framework in the observational study can be questioned, as the RIME-structured scoring forms were returned with many missing values or many items ticked off as “not relevant”—in particular, the items under psychosocial problems and all items in the Educator category. Although we encouraged written comments, these were very sparse and did not reveal any explanations concerning this particular problem. Previous studies demonstrate that psychosocial skills are not sufficiently prioritized by junior medical students, although there seems to be consensus on the importance of these skills within the literature.16,21 Thus, there is a discrepancy between what the literature suggests is important to assess and the skills that are assessed by clinician examiners. However, it is possible that nonbiomedical issues might simply not have been relevant for the cases selected for the examination of these fourth-year students. As pointed out above, our findings suggest that most fourth-year students are not at the Manager and Educator levels. Hence, clinicians might find it futile to assess competencies that are not yet developed or expected from the students. Yet, clinician examiners might also regard some of the nonbiomedical issues as less important than other issues of medical expertise.22 Furthermore, the use of problem-based learning is not common at our university, and therefore the participating students might consider that formulating learning goals would be a less important issue of competence. A pilot study of examiners’ use of the framework and a post hoc study of examiners’ explanations of the missing values and “not relevant” responses could have clarified these issues. Finally, formal training of examiners could have improved the use of the scoring form. However, although extensive programs were offered, attendance was not mandatory.
The observational study indicates that a RIME-structured scoring format is a reliable and a feasible framework for end-of-clerkship examinations. The reliability coefficients in the observational study were higher than in the experimental study, which might reflect the lack of blinding between the two assessors. However, recent literature supports the finding of increased reliability coefficients using real patients rather than simulated patients.23
The mean scores for fourth-year students were higher in the observational study than in the experimental study (83.8 versus 41.7). There are several possible explanations for that. First, the gold standard in the experimental study was the clinical competence level of interns, whereas the standard for the global overall rating in the observational study was the clinical competence level that could be expected of end-of-fourth-year students. Apparently, the participating clinician assessors also applied this standard to the RIME scoring. The inflation of the scorings over the three cohorts indicates that examiners became increasingly aware of reasonable expectations. Another explanation of the high scorings in the observational study is the fact that the clinician assessors were also the students’ tutors throughout the clerkships. This conflict of role as tutor and assessor may have led to higher scores in general.24 Finally, although the cases used in the observational study might have been more complicated than those in the experimental study, previous studies demonstrate that examiners tend to overcompensate for difficult cases.19
Recent literature suggests that expert examiners tend to interpret student performance in a more holistic and broad context of the assessment task, making examiners less focused on specific checklist aspects. It has been proposed that very elaborate and detailed rating forms disrupt the pattern-recognition processes used by expert examiners.25 Another recent study points out that assessors may agree over the performance of the trainee but disagree over the interpretation of the meaning of the response format. Hence, construct alignment of scales is recommended to improve the use of rating scales.26 Ginsburg27 argues that the scoring frameworks do not always fit with the way assessors conceptualize trainee performance. The RIME-based scoring form is aligned well to the constructs suggested by Crossley et al26 in that it provides a descriptive framework for progressively higher levels of competence. Yet, clinician examiners may perceive the content of clinical competence differently from how it is outlined in a theoretical construct-aligned rating form—For instance, clinicians might find some issues to be of little importance. Hence, a mismatch between the outline of the framework and the way trainee performance is perceived, or a disagreement on concepts of competence, might affect the use of rating forms. This should be considered when evaluating assessment instruments in the future.
It is important to note that the assessment of far more cases than one or two is needed to assess students’ overall patient encounter skills, as indicated by the low intercase reliability. The two cases selected for the experimental study were different in content, and the skills needed to manage each case did not necessarily overlap. This confirms the existing literature on this subject, stating that broad sampling across a larger number of cases is needed for reliable assessment.28,29
Moreover, the use of the RIME framework in this study differs from how it was used in previous studies. This study focused only on the cognitive aspects of managing a single patient encounter and did not use the RIME framework to evaluate issues of participants’ attitudes, dedication, and sense of responsibility over time in the clinical setting for end-of-clerkship oral exams. Hence, direct comparison with results from other settings should be made with caution.1–8
The practical implications of the experimental study are several. First, the results support that the RIME framework is a reliable and valid assessment instrument. Second, knowledge of how students progress through the RIME framework enables preceptors to make criterion-based judgments with reasonable expectations of when students become proficient Reporters, Interpreters, Managers, and Educators. This may help clinicians to identify problem students who are falling behind on particular aspects of competence and indicate areas for remediation.
In summary, this study demonstrates that, in an experimental setup, the RIME structure possessed construct validity in terms of reflecting progress in competence in managing patient encounters when assessed according to an advanced criterion. However, clinician examiners might tacitly score the elements according to what can be expected at a certain level of student experience.
Acknowledgments: The authors wish to thank participants in the experimental and the observational study as well as participating associate professors at the Faculty of Health Sciences, University of Copenhagen. The authors would also like to thank the students working at the Centre for Clinical Education, Rigshospitalet, for assisting the authors conducting the experimental study.
Funding/Support: This study was funded by the Centre for Clinical Education, Rigshospitalet.
Other disclosures: None.
Ethical approval: Ethical approval in terms of an exemption letter from the regional ethical committee of the Capital Region, Copenhagen, Denmark, was obtained before conducting the study.
1. Pangaro LN. A new vocabulary and other innovations for improving descriptive in-training evaluations. Acad Med.. 1999;74:1203–1207
2. Griffith CH III, Wilson JF. The association of student examination performance with faculty and resident ratings using a modified RIME process. J Gen Intern Med.. 2008;23:1020–1023
3. Pangaro LN. A shared professional framework for anatomy and clinical clerkships. Clin Anat.. 2006;19:419–428
4. Hemmer PA, Papp KK, Mechaber AJ, Durning SJ. Evaluation, grading, and use of the RIME vocabulary on internal medicine clerkships: Results of a national survey and comparison to other clinical clerkships. Teach Learn Med.. 2008;20:118–126
5. Battistone MJ, Milne C, Sande MA, Pangaro LN, Hemmer PA, Shomaker TS. The feasibility and acceptability of implementing formal evaluation sessions and using descriptive vocabulary to assess student performance on a clinical clerkship. Teach Learn Med.. 2002;14:5–10
6. Durning SJ, Pangaro LN, Denton GD, et al. Intersite consistency as a measurement of programmatic evaluation in a medicine clerkship with multiple, geographically separated sites. Acad Med.. 2003;78(10 suppl):S36–S38
7. Ogburn T, Espey E. The R-I-M-E method for evaluation of medical students on an obstetrics and gynecology clerkship. Am J Obstet Gynecol.. 2003;189:666–669
8. Espey E, Nuthalapaty F, Cox S, et al. To the point: Medical education review of the RIME method for the evaluation of medical student clinical performance. Am J Obstet Gynecol.. 2007;197:123–133
9. Roop S, Pangaro LN. Measuring the impact of clinical teaching on student performance during a third-year medicine clerkship. Am J Med.. 2001;110:205–209
10. Dreyfus HL, Dreyfus SE, Athanasiou T Mind Over Machine: The Power of Human Intuition and Expertise in the Era of the Computer.. New York, NY Free Press
11. Sepdham D, Julka M, Hofmann L, Dobbie A. Using the RIME model for learner assessment and feedback. Fam Med.. 2007;39:161–163
12. Krathwohl DR. A revision of Bloom’s taxonomy—An overview. Theory Into Practice.. 2002;41:212–218
13. Dewitt D, Carline J, Paauw D, Pangaro L. Pilot study of a “RIME”-based tool for giving feedback in a multi-specialty longitudinal clerkship. Med Educ.. 2008;42:1205–1209
14. Metheny PW, Espey EL, Bienstock J, et al. To the point: Medical education reviews evaluation in context: Assessing learners, teachers, and training programs. Am J Obstet Gynecol.. 2005;192:34–37
15. Davis MH, Karunathilake I. The place of the oral examination in today’s assessment systems. Med Teach.. 2005;27:294–297
16. Sanson-Fisher RW, Rolfe IE, Jones P, Ringland C, Agrez M. Trialling a new way to learn clinical skills: Systematic clinical appraisal and learning. Med Educ.. 2002;36:1028–1034
17. Eva K, Regehr G. Knowing when to look it up: A new conception of self-assessment ability. Acad Med.. 2007;82:81–84
18. Stephens MB, Gimbel RW, Pangaro L. The RIME/EMR scheme: An educational approach to clinical documentation in electronic medical records. Acad Med.. 2011;86:11–14
19. Norcini JJ, Blank LL, Duffy FDGS, Fortna. The mini-CEX: A method for assessing clinical skills. Ann Intern Med.. 2003;138:476–481
20. Johnson DW, Johnson RT Learning Together and Alone: Cooperative, Competitive, and Individualistic Learning..5th ed. Boston, Mass Allyn and Bacon
21. Rolfe IE, Sanson-Fisher RW. Translating learning principles into practice: A new strategy for learning clinical skills. Med Educ.. 2002;36:345–352
22. Norgaard K, Ringsted C, Dolmans D. Validation of a checklist to assess ward round performance in internal medicine. Med Educ.. 2004;38:700–707
23. Reinders ME, Blankenstein AH, van Marwijk HWJ, et al. Reliability of consultation skills assessments using standardised versus real patients. Med Educ.. 2011;45:578–584
24. Streiner DL, Norman GR Health Measurement Scales: A Practical Guide to Their Development and Use..4th ed. Oxford, UK Oxford University Press
25. Govaerts MJB, Schuwirth WT, Van der Vleuten CPM, Muijtjens AMM. Workplace-based assessment: Effects of rater expertise. Adv Health Sci Educ.. 2011;16:151–165
26. Crossley J, Johnson G, Booth J, Wade W. Good questions, good answers: Construct alignment improves the performance of workplace-based assessment scales. Med Educ.. 2011;45:560–569
27. Ginsburg S. Respecting the expertise of clinician assessors: Construct alignment is one good answer. Med Educ.. 2011;45:546–548
28. Sloan DA, Donnelly MB, Schartq RW, Felts JL, Blue AV, Strodel WE. The use of the objective structured clinical examination (OSCE) for evaluation and instruction in graduate medical education. J Surg Res.. 1996;63:225–230
29. Wass V, Van der Vleuten C, Shatzer J, Jones R. Assessment of clinical competence. Lancet. 2001;357:945–949